CN103593436A - File merging method and device - Google Patents

File merging method and device Download PDF

Info

Publication number
CN103593436A
CN103593436A CN201310561317.8A CN201310561317A CN103593436A CN 103593436 A CN103593436 A CN 103593436A CN 201310561317 A CN201310561317 A CN 201310561317A CN 103593436 A CN103593436 A CN 103593436A
Authority
CN
China
Prior art keywords
data
file
node
key assignments
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310561317.8A
Other languages
Chinese (zh)
Other versions
CN103593436B (en
Inventor
包海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310561317.8A priority Critical patent/CN103593436B/en
Publication of CN103593436A publication Critical patent/CN103593436A/en
Application granted granted Critical
Publication of CN103593436B publication Critical patent/CN103593436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS

Abstract

The embodiment of the invention provides a file merging method and device. The file merging method comprises the steps of carrying out partitioning on sections in the charge of nodes in a cluster system according to key value information of user data, determining that each node meets a first triggering condition, reading at least two first files from disks of the nodes to caches corresponding to the nodes, respectively determining the target partition which each key value belongs to according to the key values of the data, corresponding to users, stored in the first files, merging the data with the same key value, and storing the merged data corresponding to the key values into the corresponding target partitions. Partitioning is carried out on the sections in the nodes, the data are merged once, and the data with the same key value are stored in the same partition. In the inquiry process, the partitions where the data are located are determined according to the key values, and then the files in the partitions are scanned. As the number of the files in each partition is small, data screening only needs to be carried out on fewer files, and therefore the read performance is improved.

Description

Piece file mergence method and apparatus
Technical field
The embodiment of the present invention relates to data communication technology, relates in particular to a kind of Piece file mergence method and apparatus.
Background technology
Along with internet constantly develops, the scale of internet, applications goes from strength to strength, and the database storage of these application-dependent faces increasing challenge.Traditional relational data has been difficult to the storage demand of satisfying magnanimity data, and non-relational database NoSql applies and gives birth to, for example: the BigTable of Google exploitation, the Cassandra of Facebook is non-relevant database.Conventionally non-relational database is a distributed system, the data that it is stored are distributed on each node, non-relational database is all to realize by consistance Hash (hash) mostly at present, so-called consistent hashing, that all hash values of hash function are formed to a ring (maximal value is connected with minimum value) that joins end to end and connect, but not each node in relevant database cluster is responsible for a part for this ring, same, data to needs storage are also carried out Hash, by cryptographic hash, just can find the node of being responsible for data storage, so just reached the correspondence of data storage and node.
With regard to the data storage physical storage structure of each node, traditional relevant database has fixing piece, can repeatedly carry out the read-write of data, but not relevant database is in order to guarantee concurrent write performance, adopted the random WriteMode of disk, a logical data file of take on disk is minimum data storage cell, do not delete legacy data, but by the new and old definite up-to-date data of timestamp, this data persistence mode that is different from relevant database is that current many non-relational databases adopt.Fig. 1 is the deposit data schematic diagram of non-relational database, as shown in Figure 1, when writing data, first data are write in memory table (memory table), when the data in memory table expire, data to be written in memory table are write to disk by flush mode becomes a file group, and the data output format of this document group can be orderly string table (SSTable, Sorted String Table).Each file group comprises one group of file, and each file is respectively used to store the index information of user data, file, the hash algorithm of key assignments, static statistics file.As shown in Figure 1, the data in memory table (to be written) are being written in file group n.The data of same subscriber may be dispersed in a plurality of different files, along with being on the increase of file, data are each time read all need to from a plurality of files, carry out data screening, new and old by the timestamp of contrast identical recordings, could determine to return which bar and record to client.Under such scene, the time is more of a specified duration, and data file is more, and the performance of reading of that whole database has and is exponential decline.Therefore in many non-relational database products, proposed and realized the method that data merge, utilize key assignments identical, data are compared to (MAP-REDUCE) by merging in being distributed in scattered data file to be incorporated in a large data file, by repeatedly merging relatively, quantity of documents is reduced, with this, promote the performance of reading of database.In prior art, even by the method for Piece file mergence, the number of the file of storing on node is still larger, and data are each time read all need to from a plurality of files, carry out data screening, whole database to read performance still not high.
Summary of the invention
The embodiment of the present invention provides a kind of Piece file mergence method and apparatus, can promote the performance of reading of file.
First aspect present invention provides a kind of Piece file mergence method, comprising:
According to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, each subregion of described each node is corresponding one by one with the key assignments of user data;
For each node, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node, described each first file did not carry out merging, in described each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Target partition described in determining respectively according to the key assignments of data corresponding to each user in described each first file under each key assignments;
According to the data in each first file described in described each key-value pair with identical key assignments, merge, and the data that after merging, described each key-value pair is answered are stored in corresponding target partition.
In the possible implementation of the first of first aspect present invention, in each first file, the key assignments of data corresponding to each user is determined respectively the target partition under described each key assignments described in described basis, comprising:
According to the key assignments of data corresponding to described each user, calculate respectively the cryptographic hash that described each key-value pair is answered;
The cryptographic hash of answering according to described each key-value pair is determined the target partition under described each key assignments.
In the possible implementation of the second of first aspect present invention, according to the data in each file destination described in described each key-value pair with identical key assignments, merge, and after the data that after merging, described each key-value pair is answered are stored in corresponding target partition, also comprise:
Determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory, in described each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
According to the data in each second file described in the key-value pair of data corresponding to described each user with identical key assignments, merge, and the data that after merging, described each key-value pair is answered are stored in the 3rd file of described target partition.
The first and the possible implementation of the second in conjunction with first aspect present invention and first aspect, in the third possible implementation of first aspect present invention, according to the data in each file destination described in described each key-value pair with identical key assignments, merge, and after the data that after merging, described each key-value pair is answered are stored in corresponding target partition, also comprise:
When receiving data query request, according to the key assignments of described data query acquisition request data to be checked;
According to the key assignments of described data to be checked, determine the subregion to be checked at described data to be checked place, according to the All Files in the described subregion to be checked of key assignments scanning of described data to be checked, obtain the data to be checked that the key-value pair of described data to be checked is answered.
In the 4th kind of possible implementation of first aspect present invention, according to the key assignments of described data to be checked, determine the subregion to be checked at described data to be checked place, comprising:
Calculate the cryptographic hash that the key-value pair of described data to be checked is answered;
The cryptographic hash of answering according to the key-value pair of described data to be checked is determined the subregion to be checked at described data to be checked place.
In the 5th kind of possible implementation of first aspect present invention, described definite described node meets the first trigger condition, comprising:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, determine that described node meets the first trigger condition.
In the 6th kind of possible implementation of first aspect present invention, described definite described node meets the first trigger condition, comprising:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, according to the size of described each first file, determine whether described each first file meets the first merging condition;
If so, determine that described node meets the first trigger condition.
In the 7th kind of possible implementation of first aspect present invention, described definite described target partition meets the second trigger condition, comprising:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, determine that described target partition meets the second trigger condition.
In the 8th kind of possible implementation of first aspect present invention, described definite described target partition meets the second trigger condition, comprising:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, according to the size of described each second file, determine whether described each second file meets the second merging condition;
If meet, determine that described target partition meets the second trigger condition.
In the 9th kind of possible implementation of first aspect present invention, describedly according to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, comprising:
If the responsible interval of described group system is (min, min+2^127], the interval that described group system is responsible splits according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that described group system is responsible for, and min+2^127 represents the interval maximum cryptographic hash that described group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, N is district factor, and the value of N is to be more than or equal to 1 positive integer;
For described each node, if the responsible interval of node be (r1, r2], according to described subregion step-length S, described node is carried out to subregion, obtain | r2|/S of r1 – is interval, order is adjacent two intervally form a left side and open interval closed at the right (r n, r n+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, and described each interval is a subregion of described node, and r1 represents the interval min-hash value that described node is responsible for, r2 represents the interval maximum cryptographic hash that described node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
Second aspect present invention provides a kind of Piece file mergence device, comprising:
Division module, for subregion being carried out respectively in the responsible interval of each node of group system according to the key assignments information of user data, each subregion of described each node is corresponding one by one with the key assignments of user data;
File read module, be used for for each node, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node, described each first file did not carry out merging, in described each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Subregion determination module, for determining respectively the target partition under described each key assignments according to the key assignments of data corresponding to described each each user of the first file;
Piece file mergence module, merges for have the data of identical key assignments according to each first file described in described each key-value pair, and the data that after merging, described each key-value pair is answered is stored in corresponding target partition.
In the possible implementation of the first of second aspect present invention, described subregion determination module specifically for:
According to the key assignments of data corresponding to described each user, calculate respectively the cryptographic hash that described each key-value pair is answered;
The cryptographic hash of answering according to described each key-value pair is determined the target partition under described each key assignments.
In the possible implementation of the second of second aspect present invention, described read module also for:
Determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory, in described each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Described Piece file mergence module, also for there are the data of identical key assignments according to each second file described in the key-value pair of data corresponding to described each user, merge, and the data that after merging, described each key-value pair is answered are stored in the 3rd file of described target partition.
The first and the possible implementation of the second in conjunction with second aspect present invention and second aspect, in the third possible implementation of second aspect present invention, also comprise: receiver module, key assignments acquisition module and enquiry module;
Described receiver module, for receiving data query request;
Described key assignments acquisition module, for when described receiver module receives data query request, according to the key assignments of described data query acquisition request data to be checked;
Described subregion determination module is also for the subregion to be checked of determining described data to be checked place according to the key assignments of described data to be checked;
Described enquiry module, for according to the All Files in the described subregion to be checked of key assignments scanning of described data to be checked, obtains the data to be checked that the key-value pair of described data to be checked is answered.
In the 4th kind of possible implementation of second aspect present invention, described subregion determination module specifically for:
Calculate the cryptographic hash that the key-value pair of described data to be checked is answered;
The cryptographic hash of answering according to the key-value pair of described data to be checked is determined the subregion to be checked at described data to be checked place.
In the 5th kind of possible implementation of second aspect present invention, described file read module specifically for:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, according to the size of described each first file, determine whether described each first file meets the first merging condition;
If so, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, according to the size of described each second file, determine whether described each second file meets the second merging condition;
If meet, determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory.
In the 9th kind of possible implementation of second aspect present invention, described division module specifically for:
If the responsible interval of described group system is (min, min+2^127], the interval that described group system is responsible splits according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that described group system is responsible for, and min+2^127 represents the interval maximum cryptographic hash that described group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, N is district factor, and the value of N is to be more than or equal to 1 positive integer;
For described each node, if the responsible interval of node be (r1, r2], according to described subregion step-length S, described node is carried out to subregion, obtain | r2|/S of r1 – is interval, order is adjacent two intervally form a left side and open interval closed at the right (r n, r n+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, and described each interval is a subregion of described node, and r1 represents the interval min-hash value that described node is responsible for, r2 represents the interval maximum cryptographic hash that described node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
The Piece file mergence method and apparatus of the embodiment of the present invention, by subregion is carried out in the responsible interval of node, when meeting the first trigger condition, trigger once not merging through the file merging on node, after a secondary data merges, in same subregion, the granularity of deposit data is reduced the storage of subscriber data with identical key assignments.When inquiry, first according to the subregion at key assignments specified data place, the data that need in each data file inquiry of affiliated subarea-scanning because file number in subregion is less, only need to be carried out data screening from less file, thereby promote, read performance.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the deposit data schematic diagram of non-relational database;
Fig. 2 is cluster compartment system figure;
Fig. 3 is the schematic diagram of a kind of Piece file mergence in prior art;
Fig. 4 is the process flow diagram of Piece file mergence embodiment of the method one of the present invention;
Fig. 5 is the process flow diagram of Piece file mergence embodiment of the method two of the present invention;
Fig. 6 is the applicable partition of nodes schematic diagram of the present embodiment;
Fig. 7 is the structural representation of Piece file mergence device embodiment mono-of the present invention;
Fig. 8 is the structural representation of Piece file mergence device embodiment bis-of the present invention;
Fig. 9 is the structural representation of Piece file mergence device embodiment tri-of the present invention.
Embodiment
For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Before introducing various embodiments of the present invention, a scene first simply introducing various embodiments of the present invention and be suitable for, various embodiments of the present invention are mainly applicable to non-relational database, non-relational database can be realized by consistance Hash, so-called consistent hashing, that all hash values of hash function are formed to a Hash ring that joins end to end and connect, first of this Hash ring is maximal value, last is minimum value, be that maximal value is connected with minimum value, all cryptographic hash on this Hash ring have formed the interval of non-relational data base cluster system, and each node in cluster is responsible for the partial section of this Hash ring, be that the cryptographic hash corresponding to data of storing on each node must drop in the responsible interval of this node, same, data to needs storage are also carried out Hash, by cryptographic hash, just can find the node of being responsible for data storage, so just reached the correspondence of data storage and node.
Fig. 2 is cluster compartment system figure; as shown in Figure 2; cluster compartment system in Fig. 2 has four nodes; four great circles that distribute on Hash ring in figure represent respectively four nodes; these four great circles are dotted arrow circle pointed; four nodes are respectively node 1, node 2, node 3, node 4, the corresponding buffer memory of each node and disk, and the small circle between each great circle represents user data.The interval of the cryptographic hash that group system is responsible is (0,2^32], according to clockwise direction, interval between node 4 and node 1 is the responsible interval of node 1, interval between node 1 and node 2 is the responsible interval of node 2, interval between node 2 and node 3 is the responsible interval of node 3, and the interval between node 3 and node 4 is the responsible interval of node 4.The interval cryptographic hash that each node is responsible for is different, when each node receives data, first obtain the corresponding user's of data key assignments information, the corresponding cryptographic hash of calculation key, the judgement key-value pair cryptographic hash of answering drops in the responsible interval of which node, by storage of subscriber data on this node.Fig. 2 just describes for example, and the node of group system may be more.
Fig. 4 is the process flow diagram of Piece file mergence embodiment of the method one of the present invention, the method that the present embodiment provides is carried out respectively by each node, the method that each node provides according to the present embodiment merges as follows to the file on own node, as shown in Figure 4, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 101, according to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system.
Subregion is carried out in the interval that node is responsible, refer to that be less interval according to certain rule by the responsible interval division of node, each interval after division is a subregion, this subregion is exactly this interval sub-range, in each embodiment of the present invention, subregion is exactly certain interval sub-range.The method that the present embodiment provides, by the responsible interval of each node is further divided into less subregion, each subregion is responsible for a less interval, and each subregion is corresponding one by one with the key assignments for data.
Step 102, for each node, determine that this node meets the first trigger condition, from the disk of this node, read at least two the first files to buffer memory corresponding to this node, each first file did not carry out merging, in each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different.
On each node, have special module to be responsible for the merging of file, this module meets the first trigger condition by determining on this node, triggers Piece file mergence task when this node meets the first trigger condition.Determine whether this node meets the first trigger condition, be specially, judge whether the number of the first file of storing on this node reaches the first default Piece file mergence number M1, the value of M1 is to be more than or equal to 2 positive integer, if so, determine that node meets the first trigger condition.The first default Piece file mergence number is for example 4, when the first number of files on node reaches 4, just trigger the merging task of file, here the first file refers to the file that did not carry out merging, when this node receives user data, storage of subscriber data, in the file on node, is also stored to some and carried out the file merging on this node, if last each file of node meets this first trigger condition, trigger merging task.
Certainly, the first trigger condition can also comprise the condition that other are default, and for example the number of the first file on this node arrives default Piece file mergence number, further according to the size of each the first file, determines whether each first file meets the first merging condition; If so, determine that node meets the first trigger condition.Here, according to the size of each the first file, determine whether each first file meets the first merging condition, the difference in size that is specially each first file of judgement meets default threshold value, if the difference in size of each first file meets default threshold value, just trigger Piece file mergence task, if each first merged file difference in size is very large, do not carry out Piece file mergence, specifically by an example, illustrate, if there are 4 the first files, use respectively 1, 2, 3, 4 represent, file 1 size is 100M, file 2 sizes are 200M, file 3 sizes are 300M, file 4 sizes are 50M, after reading file 1 and file 2, the mean value of getting file 1 and file 2 sizes is 150M, then with mean value, be multiplied by the maximum weighted factor and the minimum weight factor, generally, the maximum weighted factor is 1.5, the minimum weight factor is 0.5, in the present embodiment, mean value is multiplied by the maximum weighted factor and the minimum weight factor obtains respectively 75M and 225M, if the size of file 3 drops on [75M, 225M] in interval, the size to fit of supporting paper 3 meets merging condition, can with file 1, 2 merge, in the present embodiment, file 3 size has exceeded [75M for 300M, 225M] interval, so do not meet merging condition, then can use the same method and continue the size of comparison document 4, in the present embodiment, file 4 sizes are 50M, although do not drop on [75M, 225M] in interval, but because file 4 itself is very little, even if merge and also can not take how many resources, therefore, for small documents, can not meet above-mentioned interval, a threshold value can be set, if file is less than this threshold value, directly file is merged, by threshold value setting, be for example 50M.Here, indication for an example illustrate if judge whether each file size difference meets default threshold value, can certainly judge by additive method, will not enumerate here.
Step 103, according to the key assignments of data corresponding to each user in each first file, determine respectively the target partition under each key assignments.
Read after the first file to be combined, according to the key assignments of data corresponding to each user in the first file, determine respectively the target partition under each key assignments, judge which subregion the key assignments of the data that each user is corresponding drops in.According to the key assignments of data corresponding to each user, determine respectively the target partition under each key assignments, be specially: first, according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered; Then, the cryptographic hash of answering according to each key-value pair is determined the target partition under each key assignments, and the cryptographic hash that different key-value pairs is answered may drop in different subregions.
Step 104, according to the data in each first file of each key-value pair with identical key assignments, merge, and the data that each key-value pair after merging is answered are stored in corresponding target partition.
In this step, the data in each first file with identical key assignments are merged, for example party A-subscriber's data are stored in respectively in file 1, file 2 and file 3, from these three files, read respectively party A-subscriber's data, the key assignments of party A-subscriber's data in three files is identical, after party A-subscriber's data being merged, being stored in Nei, target partition, target partition is to determine according to the cryptographic hash of key assignments corresponding to party A-subscriber's data.After Piece file mergence completes, data after merging can be stored in the second file of target partition, for example be stored in the second file of target partition A, each file can a corresponding static statistics file, static statistics file is for storing the information that this document is relevant, the time that for example data write, the information such as the size of file, in the present embodiment, by the corresponding relation of this second file and affiliated target partition is saved in static statistics file, it is follow-up when node restarts, according to static statistics file, can directly this second file be loaded in partitioned file list corresponding to this target partition, in partitioned file list, record the information of All Files in this target partition, for example each file in this target partition lays respectively at the storage address information on disk, when node restarts, according to static statistics file, all files in target partition are all carried in partitioned file list corresponding to this target partition, when carrying out data query, according to the corresponding relation of this second file and affiliated target partition in static statistics file, find this partitioned file list, then according to this partitioned file list reading out data from disk.
It should be noted that, if the data of user A are only stored in file 1, in file 2 and file 3, do not store party A-subscriber's data, at this moment, in fact in the process merging, concerning the data of user A, do not merge, the target partition of just data of user A being answered according to key-value pair, stores party A-subscriber's data in the second file on target partition into, only have in two the first files store the identical data of key assignments time, just data are merged.
The method that the present embodiment provides, by subregion is carried out in the responsible interval of node, when meeting the first trigger condition, trigger not merging through the file merging on node, in each file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different, according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered, and determine the target partition under the cryptographic hash that each key-value pair answers, then the data in file to be combined with identical key assignments are merged, and the target partition under the cryptographic hash of answering according to this key-value pair, data after merging are stored in this target partition.By node inner region is carried out to subregion, then through a secondary data, merge, in same subregion, the granularity of deposit data is reduced the storage of subscriber data with identical key assignments.When inquiry, first according to the subregion at key assignments specified data place, the data that need in each data file inquiry of affiliated subarea-scanning because file number in subregion is less, only need to be carried out data screening from less file, thereby reach, read performance boost.
Fig. 5 is the process flow diagram of Piece file mergence embodiment of the method two of the present invention, the method that the present embodiment provides is carried out respectively by each node, the method that each node provides according to the present embodiment merges as follows to the file on own node, compare with embodiment mono-, the present embodiment is on the basis of embodiment mono-, while meeting merging condition in subregion, the file in this subregion is further merged.As shown in Figure 5, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 201, according to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system.
Subregion is carried out in the interval that node is responsible, refer to that be less interval according to certain rule by the responsible interval division of node, each interval after division is a subregion.In the present embodiment, can carry out in the following manner subregion: if the responsible interval of group system is (min, min+2^127], the interval that group system is responsible splits according to 2^N, obtain subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that group system is responsible for, min+2^127 represents the interval maximum cryptographic hash that group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, and N is district factor, the value of N is to be more than or equal to 1 positive integer, and district factor N can arrange according to actual needs.
For each node, if the responsible interval of node is (r1, r2], according to subregion step-length S, node is carried out to subregion, obtaining | r2|/S of r1 – is interval, interval closed at the right (rn is opened on the left side of two interval formation that order is adjacent, rn+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, each interval is a subregion of node, r1 represents the interval min-hash value that node is responsible for, r2 represents the interval maximum cryptographic hash that node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
Here just enumerated a kind of partition method, when subregion, the size of each subregion can be the same or different, and the present invention does not limit this.
Below will illustrate by an object lesson, Fig. 6 is the applicable partition of nodes schematic diagram of the present embodiment, as shown in Figure 6, in upper figure, circle represents whole group system, in group system, have four nodes, respectively node A, Node B, node C, node D, the scope of the cryptographic hash of the interval correspondence that node A is responsible be (70,10], wherein, 70 and 10 represent cryptographic hash, the responsible interval of Node B be (10,30], the interval that node C is responsible is (30,50] interval that, node D is responsible be (50,70].Take Node B as example describes, by the responsible interval of Node B, be first (10,30] be divided into three subregions, be respectively subregion (slice) B1, subregion B2 and subregion B3, the interval of subregion B1 be (10,15], the interval of subregion B2 is (15,25], the interval of subregion B3 be (25,30].In Node B, there are 5 the first files to be combined, by these 5 Piece file mergences, be 3 the second files, 3 the second files after merging belong to respectively different subregions, the second file after merging is stored into respectively corresponding target partition, as shown in Figure 6, arrow indicated direction represents the target partition under each second file, below the example by concrete is illustrated, supposing has 5 the first files to be combined to be respectively in Node B: file 1, file 2, file 3, file 4 and file 5, wherein, on file 1, store three users' data, key assignments corresponding to data of user A1 is a1, key assignments corresponding to data of user A2 is a2, key assignments corresponding to data of user A3 is a3, on file 2, store the data of user A1 and A2, on file 3, store the data of user A1, also store the data of user A4 and A5, key assignments corresponding to data of user A4 is a4, key assignments corresponding to data of user A5 is a5, on file 4, store the data of user A3 and user A4, on file 5, store the data of user A2, A3, A4, A5.Here the data that the key assignments of take is a1 are merged into example and are described, first, according to cryptographic hash corresponding to hash algorithm calculation key a1, then determine which subregion is cryptographic hash corresponding to key assignments a1 drop in, suppose that the cryptographic hash that key assignments a1 is corresponding drops in subregion B1, subregion B1 is defined as to the target partition under key assignments a1, last, the data that in file 1, file 2 and file 3, key assignments is a1 are merged, and the data after merging are stored in the file 6 on subregion B1.After the same method, to key assignments, be that a2, a3, a4 and a5 merge respectively, suppose that the target partition under key assignments a2 and a5 is subregion B2, target partition under key assignments a1 and a3 is subregion B1, target partition under key assignments a4 is subregion B3, the data after key assignments a1 and a3 merging are stored in file 6, file 6 belongs to subregion B1, data after key assignments a2 and a5 merging are stored in file 7, file 7 belongs to B2, data after key assignments a4 is merged are stored in file 8, and file 8 belongs to B3, and file 6, file 7, file 8 are all the second files.By above-mentioned merging, data in five the first files are merged in three the second files, each second file belongs to different sections, thereby the data of user A1, A3 are all stored on subregion B1 by merging, the data of user A2, A5 are all stored in subregion B2 above by merging, and the data of user A4 are stored on subregion B3 after merging.
Step 202, for each node, determine that this node meets the first trigger condition, reads at least two the first files to buffer memory corresponding to this node from the disk of this node.
In this step, the first file refers to the file that did not carry out merging, in each first file, stores data corresponding at least one user, and the key assignments of the data that each user is corresponding is different.Specific implementation can be with reference to the description in a kind of step 102 of embodiment.
Step 203, according to the key assignments of data corresponding to each user in each first file, determine respectively the target partition under each key assignments.
The data that different user is corresponding can be distinguished by key assignments, and for each user, the key assignments of the data that this user is all is all identical, and user data and key assignments are corresponding one by one, when inquiry, according to key assignments, find data corresponding to this user.In the present embodiment, according to the key assignments of data corresponding to each user, determine that respectively the target partition under each key assignments is specially: first, according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered, after obtaining the cryptographic hash that key-value pair answers, judge which subregion is each cryptographic hash belong to respectively, each subregion is responsible for an interval, and the subregion under the cryptographic hash that each key-value pair is answered is the target partition under each key assignments.
Step 204, according to the data in each first file of each key-value pair with identical key assignments, merge, and the data that each key-value pair after merging is answered are stored in corresponding target partition.
In this step, the data in each first file with identical key assignments are merged, and the data after merging are stored in the second file of target partition.After merging completes, by this second file with the relational storage of affiliated target partition in static statistics file corresponding to this second file, the corresponding static statistics file of each file, it is follow-up when node restarts, can directly this second file be loaded in partitioned file list corresponding to this target partition according to static statistics file, when carrying out data query, according to the relation of this second file in static statistics file and affiliated target partition, find the partitioned file list of this target partition, then according to partitioned file list reading out data from disk of this target partition.
Step 205, determine that target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files in target partition to buffer memory.
In above-mentioned steps 201-204, by node inner region is carried out to subregion, the granularity of deposit data is reduced again, the file that did not carry out merging is once merged, by once merging data are distributed in different subregions.And in this step, when the file in certain subregion meets merging condition, the file in subregion is merged again, each subregion is all independent, when merging, only the file in this subregion is merged.Particularly, when target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files to be combined in target partition to buffer memory, in each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different.
In the present embodiment, determine whether target partition meets the second trigger condition, be specially, whether the number that judges the second file of storing on target partition reaches the second default Piece file mergence number M2, the value of M2 is to be more than or equal to 2 positive integer, certainly, the second trigger condition can also comprise the condition that other are default, for example the number of the second file on this target partition arrives the second default Piece file mergence number, further according to the size of each the second file, determine whether each second file meets the second merging condition, determine whether each second file meets the second merging condition and specifically refer to whether the difference in size that judges each second file meets default max-thresholds, if each second file size difference meets default max-thresholds and just triggers Piece file mergence task, if each second file size is widely different, do not carry out Piece file mergence, in the present embodiment, whether each second file size difference of judgement meets default max-thresholds, can adopt in embodiment mono-and judge whether each first file size difference meets the method for default max-thresholds, here repeat no more, but concrete parameter setting can be different.For instance, if second file size is 500M, and another second file size is 30M, does not merge.Default Piece file mergence number M2 can arrange according to the actual needs, if high to query rate requirement, what the value of M2 can be arranged is smaller, increases and merges number of times, to reduce the number of file, improves search efficiency.
Step 206, according to the data in each second file of the key-value pair of data corresponding to each user with identical key assignments, merge, and the data that each key-value pair after merging is answered are stored in the 3rd file of target partition.
In this step, the data in each second file to be combined in target partition with identical key assignments are incorporated in same the 3rd file, make data height polymerization in same subregion.Fig. 3 is the schematic diagram of a kind of Piece file mergence in prior art, as shown in Figure 3, after file number surpasses 4, will initiate merging task, 4 Piece file mergences are become to a new file, if while having again new Generating Data File, can, by big or small comparison, choose the close data file of size and merge.Specific implementation is, all sstable under same table are divided into groups by size, the similar sstable of size is classified as to one group, so just form n group (n>=1), then from this n group, obtain one of mean size minimum group and form a task and carry out the operation of compaction.The size of every group must (4,32] scope in, the too much group of quantity will be blocked, the mode by map-reduce will have the data of identical key assignments to be merged together afterwards, thereby forms new data file.And in the method, support multi-thread concurrent to carry out a plurality of tasks, the data file that each task is chosen is not identical.But also there is following problem in prior art, the trigger condition that data file merges is low, 4 files can trigger, and occur every day to merge often, and a large amount of Piece file mergences can take the resources such as sizable internal memory, CPU, I/O, and take turns after merging through too much in data file, key assignments has reached quite high degree of integration, and the merging of now carrying out is again only the resettlement of carrying out key assignments data, only have a small amount of merging action, cause the waste of system resource.The method that the present embodiment provides is by becoming little partitioned file by large file declustering, while only having file in subregion to meet merging condition during merging, just the file in subregion is merged, due to file number in subregion with respect to the file number on a node in prior art seldom, thereby can reduce merging number of times, improve combined efficiency, in the situation that data are evenly distributed, the file size of storing on each subregion is identical, and the combined efficiency of same quantity of data can promote more than 30%.
Step 207, when receiving data query request, according to the key assignments of data query acquisition request data to be checked, according to the key assignments of data to be checked, determine the subregion to be checked at data to be checked place, according to the key assignments of data to be checked, scan the All Files in subregion to be checked, obtain the data to be checked that the key-value pair of data to be checked is answered.
During data query, first according to the key assignments of data to be checked, determine the subregion at data to be checked place, then the data file in affiliated subregion is carried out to filter scan, and then will obtain qualified result and return.Particularly, when receiving data query request, parse the key assignments information of data to be checked, then, calculate the cryptographic hash that the key-value pair of data to be checked is answered, the cryptographic hash of answering according to the key-value pair of data to be checked is determined the subregion to be checked at data to be checked place.According to the key assignments of data to be checked, scan the All Files in subregion to be checked, obtain the data to be checked that key-value pair answers and return.
The method that this enforcement provides, can also promote search efficiency, in query script, the inquiry scan mode before contrast, need to once judge All Files on disk, determine that the data that will inquire about are whether in this data file, make search efficiency low, and need to be from disk reading out data frequently, increased the I/O expense of system, also can take too much CPU and internal memory, cause the waste of resource simultaneously.Tentation data is evenly distributed, and data volume is 100G, and query depth is 1000, if node is divided into four subregions, subregion search efficiency can promote 25%.
The method that the present embodiment provides, by node inner region is carried out to subregion, the granularity of deposit data is reduced again, by once merging, data are distributed in different subregions, and while meeting merging condition in subregion, the file in subregion are merged, the data in subregion with identical key assignments are merged in a file, improved the degree of polymerization of file in subregion.In query script, first according to key assignments, determine the subregion at the data place that will inquire about, data file in affiliated subregion is carried out to filter scan, and then will obtain qualified result and return, because file number in subregion is few, and the data height polymerization in each file, can reduce inquiry times, improved search efficiency.
Fig. 7 is the structural representation of Piece file mergence device embodiment mono-of the present invention, the Piece file mergence device that the present embodiment provides can be integrated on each node of group system, can certainly independently arrange, as shown in Figure 7, the Piece file mergence device that the present embodiment provides comprises: division module 31, file read module 32, subregion determination module 33, Piece file mergence module 34.
Wherein, division module 31, for subregion being carried out respectively in the responsible interval of each node of group system according to the key assignments information of user data, each subregion of each node is corresponding one by one with the key assignments of user data;
File read module 32, be used for for each node, determine that node meets the first trigger condition, from the disk of node, read at least two the first files to buffer memory corresponding to node, each first file did not carry out merging, in each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Subregion determination module 33, for determining respectively the target partition under each key assignments according to the key assignments of data corresponding to each user of each first file;
Piece file mergence module 34, merge, and the data that each key-value pair after merging is answered is stored in corresponding target partition for have the data of identical key assignments according to each first file of each key-value pair.
In the present embodiment, subregion determination module 33 specifically for: according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered; The cryptographic hash of answering according to each key-value pair is determined the target partition under each key assignments.
The Piece file mergence device that the present embodiment provides, can be used for the technical scheme that manner of execution embodiment mono-provides, and specific implementation and technique effect type, repeat no more here.
Fig. 8 is the structural representation of Piece file mergence device embodiment bis-of the present invention, the Piece file mergence device that the present embodiment provides can be integrated on each node of group system, can certainly independently arrange, as shown in Figure 8, the Piece file mergence device that the present embodiment provides comprises: division module 41, file read module 42, subregion determination module 43, Piece file mergence module 44.
Wherein, division module 41, for subregion being carried out respectively in the responsible interval of each node of group system according to the key assignments information of user data, each subregion of each node is corresponding one by one with the key assignments of user data;
File read module 42, be used for for each node, determine that node meets the first trigger condition, from the disk of node, read at least two the first files to buffer memory corresponding to node, each first file did not carry out merging, in each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Subregion determination module 43, for determining respectively the target partition under each key assignments according to the key assignments of data corresponding to each user of each first file;
Piece file mergence module 44, merge, and the data that each key-value pair after merging is answered is stored in corresponding target partition for have the data of identical key assignments according to each first file of each key-value pair.
In the present embodiment, division module 41 specifically carries out subregion to each node of group system in the following manner: if the responsible interval of group system is (min, min+2^127], the interval that group system is responsible splits according to 2^N, obtain subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that group system is responsible for, min+2^127 represents the interval maximum cryptographic hash that group system is responsible for, the value of min is to be more than or equal to 0 positive integer, and N is district factor, and the value of N is to be more than or equal to 1 positive integer, for each node, if the responsible interval of node is (r1, r2], according to subregion step-length S, node is carried out to subregion, obtaining | r2|/S of r1 – is interval, interval closed at the right (rn is opened on the left side of two interval formation that order is adjacent, rn+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, each interval is a subregion of node, r1 represents the interval min-hash value that node is responsible for, r2 represents the interval maximum cryptographic hash that node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
File read module 42 specifically for: whether the number of the first file of storing in decision node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer; If so, determine that node meets the first trigger condition, reads at least two the first files to buffer memory corresponding to node from the disk of node.Or, file read module 42 specifically for: whether the number of the first file of storing in decision node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer; If so, according to the size of each the first file, determine whether each first file meets the first merging condition; If so, determine that node meets the first trigger condition, reads at least two the first files to buffer memory corresponding to node from the disk of node.
Subregion determination module 43 specifically for: according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered; The cryptographic hash of answering according to each key-value pair is determined the target partition under each key assignments.
For each node, by after once merging, file distribution is arrived in each subregion of this node, in the present embodiment, when each distribution meets merging condition, further the file in subregion is merged, make the file height polymerization in each subregion, therefore, in the present embodiment, file read module 42 also for: determine that target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files in target partition to buffer memory, in each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different.Correspondingly, Piece file mergence module 44, also merges for have the data of identical key assignments according to each second file of the key-value pair of data corresponding to each user, and the corresponding data of each key assignments after merging is stored in the 3rd file of target partition.
In the present embodiment, file read module 42 specifically determines by following two kinds of modes whether target partition meets the second trigger condition: the first, whether the number that judges the second file of storing on target partition reaches the second default Piece file mergence number M2, wherein, the value of M2 is to be more than or equal to 2 positive integer; If so, determine that target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files in target partition to buffer memory.The second, judges whether the number of the second file of storing on target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer; If so, according to the size of each the second file, determine whether each second file meets the second merging condition; If meet, determine that target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files in target partition to buffer memory.
Further, the Piece file mergence device of the present embodiment also comprises: receiver module 45, key assignments acquisition module 46 and enquiry module 47, and receiver module 45, for receiving data query request; Key assignments acquisition module 46, when receiving data query request when receiver module, according to the key assignments of data query acquisition request data to be checked; Subregion determination module 43 is also for the subregion to be checked of determining data to be checked place according to the key assignments of data to be checked; Enquiry module 47, for scanning the All Files in subregion to be checked according to the key assignments of data to be checked, obtains the data to be checked that the key-value pair of data to be checked is answered.
In the present embodiment, subregion determination module 43 is determined the subregion to be checked at data to be checked place in the following manner: first, calculate the cryptographic hash that the key-value pair of data to be checked is answered, then, the cryptographic hash of answering according to the key-value pair of data to be checked is determined the subregion to be checked at data to be checked place.
The Piece file mergence device that the present embodiment provides can be used for the technical scheme that manner of execution embodiment bis-provides, and specific implementation and technique effect are similar, repeat no more here.
Fig. 9 is the structural representation of Piece file mergence device embodiment tri-of the present invention, and the Piece file mergence device 500 that the present embodiment provides comprises: processor 51, storer 52, receiver 53.Storer 52, receiver 53 are connected with processor 51 by bus.Wherein, instructions are carried out in storer 52 storage, when 500 operation of Piece file mergence device, between processor 51 and storer 52, communicate by letters, and processor 51 is carried out instructions Piece file mergence device 500 is operated below carrying out:
According to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, each subregion of each node is corresponding one by one with the key assignments of user data;
For each node, determine that node meets the first trigger condition, from the disk of node, read at least two the first files to buffer memory corresponding to node, each first file did not carry out merging, in each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
According to the key assignments of data corresponding to each user in each first file, determine respectively the target partition under each key assignments;
According to the data in each first file of each key-value pair with identical key assignments, merge, and the corresponding data of each key assignments after merging are stored in corresponding target partition.
According in each first file the key assignments of data corresponding to each user determine respectively the target partition under each key assignments, be specially: according to the key assignments of data corresponding to each user, calculate respectively the cryptographic hash that each key-value pair is answered; The cryptographic hash of answering according to each key-value pair is determined the target partition under each key assignments.
Processor 51 also for:
Determine that target partition meets the second trigger condition, from the disk of the node at place, target partition, read at least two the second files in target partition to buffer memory, in each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
According to the data in each second file of the key-value pair of data corresponding to each user with identical key assignments, merge, and the data after merging with identical key assignments are stored in the 3rd file of target partition.
Wherein, according to the key assignments of data to be checked, determine the subregion to be checked at data to be checked place, comprising: calculate the cryptographic hash that the key-value pair of data to be checked is answered; The cryptographic hash of answering according to the key-value pair of data to be checked is determined the subregion to be checked at data to be checked place.
In the present embodiment, determine that whether node meets the first trigger condition, comprising: whether the number of the first file of storing in decision node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer; If so, determine that node meets the first trigger condition.
Or whether the number of the first file of storing in decision node reaches the first default Piece file mergence number M1, wherein, the value of M1 is to be more than or equal to 2 positive integer; If so, according to the size of each the first file, determine whether each first file meets the first merging condition; If so, determine that node meets the first trigger condition.
In the present embodiment, determine that target partition meets the second trigger condition, comprising: whether the number that judges the second file of storing on target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer; If so, determine that target partition meets the second trigger condition.Or, judge whether the number of the second file of storing on target partition reaches the second default Piece file mergence number M2, wherein, the value of M2 is to be more than or equal to 2 positive integer; If so, according to the size of each the second file, determine whether each second file meets the second merging condition; If meet, determine that target partition meets the second trigger condition.
In the present embodiment, according to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, comprising:
If the responsible interval of group system is (min, min+2^127], the interval that group system is responsible splits according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that group system is responsible for, and min+2^127 represents the interval maximum cryptographic hash that group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, N is district factor, and the value of N is to be more than or equal to 1 positive integer;
For each node, if the responsible interval of node be (r1, r2], according to subregion step-length S, node is carried out to subregion, obtain | r2|/S of r1 – is interval, order is adjacent two intervally form a left side and open interval closed at the right (r n, r n+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, and each interval is a subregion of node, and r1 represents the interval min-hash value that node is responsible for, r2 represents the interval maximum cryptographic hash that node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
In the present embodiment, receiver 53 is for receiving data query request, and processor 51 is also for when receiving data query request, according to the key assignments of data query acquisition request data to be checked; The subregion to be checked of determining data to be checked place according to the key assignments of data to be checked, scans the All Files in subregion to be checked according to the key assignments of data to be checked, obtains the data to be checked that the key-value pair of data to be checked is answered.
The Piece file mergence device that the present embodiment provides, can be used for the method shown in manner of execution embodiment mono-and embodiment bis-, and specific implementation and technique effect are similar, repeat no more here.
One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each embodiment of the method can complete by the relevant hardware of programmed instruction.Aforesaid program can be stored in a computer read/write memory medium.This program, when carrying out, is carried out the step that comprises above-mentioned each embodiment of the method; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.
Finally it should be noted that: each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit above; Although the present invention is had been described in detail with reference to aforementioned each embodiment, those of ordinary skill in the art is to be understood that: its technical scheme that still can record aforementioned each embodiment is modified, or some or all of technical characterictic is wherein equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims (20)

1. a Piece file mergence method, is characterized in that, comprising:
According to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, each subregion of described each node is corresponding one by one with the key assignments of user data;
For each node, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node, described each first file did not carry out merging, in described each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Target partition described in determining respectively according to the key assignments of data corresponding to each user in described each first file under each key assignments;
According to the data in each first file described in described each key-value pair with identical key assignments, merge, and the data that after merging, described each key-value pair is answered are stored in corresponding target partition.
2. method according to claim 1, is characterized in that, in each first file, the key assignments of data corresponding to each user is determined respectively the target partition under described each key assignments described in described basis, comprising:
According to the key assignments of data corresponding to described each user, calculate respectively the cryptographic hash that described each key-value pair is answered;
The cryptographic hash of answering according to described each key-value pair is determined the target partition under described each key assignments.
3. method according to claim 1, it is characterized in that, according to the data in each file destination described in described each key-value pair with identical key assignments, merge, and after the data that after merging, described each key-value pair is answered are stored in corresponding target partition, also comprise:
Determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory, in described each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
According to the data in each second file described in the key-value pair of data corresponding to described each user with identical key assignments, merge, and the data that after merging, described each key-value pair is answered are stored in the 3rd file of described target partition.
4. according to the method described in claim 1-3 any one, it is characterized in that, according to the data in each file destination described in described each key-value pair with identical key assignments, merge, and after the data that after merging, each key-value pair is answered are stored in corresponding target partition, also comprise:
When receiving data query request, according to the key assignments of described data query acquisition request data to be checked;
According to the key assignments of described data to be checked, determine the subregion to be checked at described data to be checked place, according to the All Files in the described subregion to be checked of key assignments scanning of described data to be checked, obtain the data to be checked that the key-value pair of described data to be checked is answered.
5. method according to claim 4, is characterized in that, determines the subregion to be checked at described data to be checked place according to the key assignments of described data to be checked, comprising:
Calculate the cryptographic hash that the key-value pair of described data to be checked is answered;
The cryptographic hash of answering according to the key-value pair of described data to be checked is determined the subregion to be checked at described data to be checked place.
6. method according to claim 1, is characterized in that, described definite described node meets the first trigger condition, comprising:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, determine that described node meets the first trigger condition.
7. method according to claim 1, is characterized in that, described definite described node meets the first trigger condition, comprising:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, according to the size of described each first file, determine whether described each first file meets the first merging condition;
If so, determine that described node meets the first trigger condition.
8. method according to claim 3, is characterized in that, described definite described target partition meets the second trigger condition, comprising:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, determine that described target partition meets the second trigger condition.
9. method according to claim 3, is characterized in that, described definite described target partition meets the second trigger condition, comprising:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, according to the size of described each second file, determine whether described each second file meets the second merging condition;
If meet, determine that described target partition meets the second trigger condition.
10. method according to claim 1, is characterized in that, describedly according to the key assignments information of user data, subregion is carried out respectively in the responsible interval of each node in group system, comprising:
If the responsible interval of described group system is (min, min+2^127], the interval that described group system is responsible splits according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that described group system is responsible for, and min+2^127 represents the interval maximum cryptographic hash that described group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, N is district factor, and the value of N is to be more than or equal to 1 positive integer;
For described each node, if the responsible interval of node be (r1, r2], according to described subregion step-length S, described node is carried out to subregion, obtain | r2|/S of r1 – is interval, order is adjacent two intervally form a left side and open interval closed at the right (r n, r n+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, and described each interval is a subregion of described node, and r1 represents the interval min-hash value that described node is responsible for, r2 represents the interval maximum cryptographic hash that described node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
11. 1 kinds of Piece file mergence devices, is characterized in that, comprising:
Division module, for subregion being carried out respectively in the responsible interval of each node of group system according to the key assignments information of user data, each subregion of described each node is corresponding one by one with the key assignments of user data;
File read module, be used for for each node, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node, described each first file did not carry out merging, in described each first file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Subregion determination module, for determining respectively the target partition under described each key assignments according to the key assignments of data corresponding to described each each user of the first file;
Piece file mergence module, merges for have the data of identical key assignments according to each first file described in described each key-value pair, and the corresponding data of described each key assignments after merging is stored in corresponding target partition.
12. devices according to claim 11, is characterized in that, described subregion determination module specifically for:
According to the key assignments of data corresponding to described each user, calculate respectively the cryptographic hash that described each key-value pair is answered;
The cryptographic hash of answering according to described each key-value pair is determined the target partition under described each key assignments.
13. devices according to claim 11, is characterized in that, described read module also for:
Determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory, in described each second file, store data corresponding at least one user, the key assignments of the data that each user is corresponding is different;
Described Piece file mergence module, also for there are the data of identical key assignments according to each second file described in the key-value pair of data corresponding to described each user, merge, and the data that after merging, described each key-value pair is answered are stored in the 3rd file of described target partition.
14. according to the device described in claim 11-13 any one, it is characterized in that, also comprises: receiver module, key assignments acquisition module and enquiry module;
Described receiver module, for receiving data query request;
Described key assignments acquisition module, for when described receiver module receives data query request, according to the key assignments of described data query acquisition request data to be checked;
Described subregion determination module is also for the subregion to be checked of determining described data to be checked place according to the key assignments of described data to be checked;
Described enquiry module, for according to the All Files in the described subregion to be checked of key assignments scanning of described data to be checked, obtains the data to be checked that the key-value pair of described data to be checked is answered.
15. devices according to claim 14, is characterized in that, described subregion determination module specifically for:
Calculate the cryptographic hash that the key-value pair of described data to be checked is answered;
The cryptographic hash of answering according to the key-value pair of described data to be checked is determined the subregion to be checked at described data to be checked place.
16. devices according to claim 11, is characterized in that, described file read module specifically for:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node.
17. devices according to claim 11, is characterized in that, described file read module specifically for:
Whether the number that judges the first file of storing on described node reaches the first default Piece file mergence number M1, and wherein, the value of M1 is to be more than or equal to 2 positive integer;
If so, according to the size of described each first file, determine whether described each first file meets the first merging condition;
If so, determine that described node meets the first trigger condition, from the disk of described node, read at least two the first files to buffer memory corresponding to described node.
18. devices according to claim 13, is characterized in that, described file read module specifically for:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory.
19. devices according to claim 13, is characterized in that, described file read module specifically for:
Whether the number that judges the second file of storing on described target partition reaches the second default Piece file mergence number M2, and wherein, the value of M2 is to be more than or equal to 2 positive integer;
If so, according to the size of described each second file, determine whether described each second file meets the second merging condition;
If meet, determine that described target partition meets the second trigger condition, from the disk of the node at place, described target partition, read at least two the second files in described target partition to buffer memory.
20. devices according to claim 11, is characterized in that, described division module specifically for:
If the responsible interval of described group system is (min, min+2^127], the interval that described group system is responsible splits according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents the interval min-hash value that described group system is responsible for, and min+2^127 represents the interval maximum cryptographic hash that described group system is responsible for, and the value of min is to be more than or equal to 0 positive integer, N is district factor, and the value of N is to be more than or equal to 1 positive integer;
For described each node, if the responsible interval of node be (r1, r2], according to described subregion step-length S, described node is carried out to subregion, obtain | r2|/S of r1 – is interval, order is adjacent two intervally form a left side and open interval closed at the right (r n, r n+1], wherein, 0<n<|r1 – r2|/S – 1, n is positive integer, and described each interval is a subregion of described node, and r1 represents the interval min-hash value that described node is responsible for, r2 represents the interval maximum cryptographic hash that described node is responsible for, the value of r1 and r2 is to be all more than or equal to 0 positive integer, and r2 is greater than r1, | r1 – r2| represents to get the absolute value of r1 and r2 difference.
CN201310561317.8A 2013-11-12 2013-11-12 file merging method and device Active CN103593436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310561317.8A CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310561317.8A CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Publications (2)

Publication Number Publication Date
CN103593436A true CN103593436A (en) 2014-02-19
CN103593436B CN103593436B (en) 2017-02-08

Family

ID=50083577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310561317.8A Active CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Country Status (1)

Country Link
CN (1) CN103593436B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824A (en) * 2014-11-10 2015-02-18 北京奇虎科技有限公司 Data merging method and device
CN105159915A (en) * 2015-07-16 2015-12-16 中国科学院计算技术研究所 Dynamically adaptive LSM (Log-structured merge) tree combination method and system
WO2017005094A1 (en) * 2015-07-03 2017-01-12 阿里巴巴集团控股有限公司 Data query method and device
CN106446039A (en) * 2016-08-30 2017-02-22 北京航空航天大学 Aggregation type big data search method and device
CN106599247A (en) * 2016-12-19 2017-04-26 北京奇虎科技有限公司 Method and device for merging data file in LSM-tree structure
CN106708968A (en) * 2016-12-01 2017-05-24 成都华为技术有限公司 Distributed database system and data processing method in distributed database system
CN106777230A (en) * 2016-12-26 2017-05-31 东软集团股份有限公司 A kind of partition system, partition method and device
CN106776811A (en) * 2016-11-23 2017-05-31 李天� data index method and device
CN107357921A (en) * 2017-07-21 2017-11-17 北京奇艺世纪科技有限公司 A kind of small documents storage localization method and system
CN107391541A (en) * 2017-05-16 2017-11-24 阿里巴巴集团控股有限公司 A kind of real time data merging method and device
CN107861959A (en) * 2016-09-22 2018-03-30 阿里巴巴集团控股有限公司 Data processing method, apparatus and system
WO2018077092A1 (en) * 2016-10-31 2018-05-03 中兴通讯股份有限公司 Saving method applied to distributed file system, apparatus and distributed file system
CN108628542A (en) * 2017-03-22 2018-10-09 华为技术有限公司 A kind of Piece file mergence method and controller
CN110019092A (en) * 2017-12-27 2019-07-16 杭州华为数字技术有限公司 Method, controller and the system of data storage
WO2019179449A1 (en) * 2018-03-22 2019-09-26 中国银联股份有限公司 Method and apparatus for combining regions of hbase table, and computer device
CN110321349A (en) * 2019-06-13 2019-10-11 暨南大学 A kind of self-adapting data of data-oriented origin system merges storage method
CN110399545A (en) * 2018-04-20 2019-11-01 伊姆西Ip控股有限责任公司 The method and apparatus of management document index
WO2020034818A1 (en) * 2018-08-14 2020-02-20 华为技术有限公司 Partition merging method and database server
CN110825794A (en) * 2018-08-14 2020-02-21 华为技术有限公司 Partition merging method and database server
CN110888837A (en) * 2019-11-15 2020-03-17 星辰天合(北京)数据科技有限公司 Object storage small file merging method and device
CN113342813A (en) * 2021-06-09 2021-09-03 南京冰鉴信息科技有限公司 Key value data processing method and device, computer equipment and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046638B (en) * 2018-12-29 2023-06-23 创新先进技术有限公司 Method, device and equipment for fusing data among multiple platforms

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605028A (en) * 2009-02-17 2009-12-16 北京安天电子设备有限公司 A kind of combining log records method and system
CN102905311B (en) * 2012-09-29 2015-07-15 北京傲天动联技术股份有限公司 Data-message aggregating device and method
CN102968503B (en) * 2012-12-10 2015-10-07 曙光信息产业(北京)有限公司 The data processing method of Database Systems and Database Systems

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824B (en) * 2014-11-10 2017-12-12 北京奇虎科技有限公司 The method and apparatus that a kind of data merge
CN104360824A (en) * 2014-11-10 2015-02-18 北京奇虎科技有限公司 Data merging method and device
WO2017005094A1 (en) * 2015-07-03 2017-01-12 阿里巴巴集团控股有限公司 Data query method and device
CN105159915A (en) * 2015-07-16 2015-12-16 中国科学院计算技术研究所 Dynamically adaptive LSM (Log-structured merge) tree combination method and system
CN106446039A (en) * 2016-08-30 2017-02-22 北京航空航天大学 Aggregation type big data search method and device
CN106446039B (en) * 2016-08-30 2020-07-21 北京航空航天大学 Aggregation type big data query method and device
CN107861959A (en) * 2016-09-22 2018-03-30 阿里巴巴集团控股有限公司 Data processing method, apparatus and system
WO2018077092A1 (en) * 2016-10-31 2018-05-03 中兴通讯股份有限公司 Saving method applied to distributed file system, apparatus and distributed file system
CN106776811A (en) * 2016-11-23 2017-05-31 李天� data index method and device
CN106708968A (en) * 2016-12-01 2017-05-24 成都华为技术有限公司 Distributed database system and data processing method in distributed database system
CN106708968B (en) * 2016-12-01 2019-11-26 成都华为技术有限公司 Data processing method in distributed data base system and distributed data base system
CN106599247B (en) * 2016-12-19 2020-04-17 北京奇虎科技有限公司 Method and device for merging data files in LSM-tree structure
CN106599247A (en) * 2016-12-19 2017-04-26 北京奇虎科技有限公司 Method and device for merging data file in LSM-tree structure
CN106777230A (en) * 2016-12-26 2017-05-31 东软集团股份有限公司 A kind of partition system, partition method and device
CN106777230B (en) * 2016-12-26 2020-01-07 东软集团股份有限公司 Partition system, partition method and device
CN108628542A (en) * 2017-03-22 2018-10-09 华为技术有限公司 A kind of Piece file mergence method and controller
US11403021B2 (en) 2017-03-22 2022-08-02 Huawei Technologies Co., Ltd. File merging method and controller
CN108628542B (en) * 2017-03-22 2021-08-03 华为技术有限公司 File merging method and controller
CN107391541A (en) * 2017-05-16 2017-11-24 阿里巴巴集团控股有限公司 A kind of real time data merging method and device
CN107391541B (en) * 2017-05-16 2020-10-20 创新先进技术有限公司 Real-time data merging method and device
CN107357921A (en) * 2017-07-21 2017-11-17 北京奇艺世纪科技有限公司 A kind of small documents storage localization method and system
CN110019092B (en) * 2017-12-27 2021-07-09 华为技术有限公司 Data storage method, controller and system
CN110019092A (en) * 2017-12-27 2019-07-16 杭州华为数字技术有限公司 Method, controller and the system of data storage
US11372822B2 (en) 2018-03-22 2022-06-28 China Unionpay Co., Ltd. Method, device, and computer apparatus for merging regions of HBase table
WO2019179449A1 (en) * 2018-03-22 2019-09-26 中国银联股份有限公司 Method and apparatus for combining regions of hbase table, and computer device
CN110399545A (en) * 2018-04-20 2019-11-01 伊姆西Ip控股有限责任公司 The method and apparatus of management document index
CN110399545B (en) * 2018-04-20 2023-06-02 伊姆西Ip控股有限责任公司 Method and apparatus for managing document index
WO2020034818A1 (en) * 2018-08-14 2020-02-20 华为技术有限公司 Partition merging method and database server
CN110825794B (en) * 2018-08-14 2022-03-29 华为云计算技术有限公司 Partition merging method and database server
CN110825794A (en) * 2018-08-14 2020-02-21 华为技术有限公司 Partition merging method and database server
US11762881B2 (en) 2018-08-14 2023-09-19 Huawei Cloud Computing Technologies Co., Ltd. Partition merging method and database server
CN110321349A (en) * 2019-06-13 2019-10-11 暨南大学 A kind of self-adapting data of data-oriented origin system merges storage method
CN110888837A (en) * 2019-11-15 2020-03-17 星辰天合(北京)数据科技有限公司 Object storage small file merging method and device
CN113342813A (en) * 2021-06-09 2021-09-03 南京冰鉴信息科技有限公司 Key value data processing method and device, computer equipment and readable storage medium
CN113342813B (en) * 2021-06-09 2024-01-26 南京冰鉴信息科技有限公司 Key value data processing method, device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN103593436B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN103593436A (en) File merging method and device
CN109196459B (en) Decentralized distributed heterogeneous storage system data distribution method
CN102332029B (en) Hadoop-based mass classifiable small file association storage method
US10048872B2 (en) Control of storage of data in a hybrid storage system
CN101944124B (en) Distributed file system management method, device and corresponding file system
CN104115133B (en) For method, system and the equipment of the Data Migration for being combined non-volatile memory device
CN104899297B (en) Create the method with the hybrid index of storage perception
CN109522428B (en) External memory access method of graph computing system based on index positioning
CN109240607B (en) File reading method and device
CN106911743B (en) Small documents write polymerization, read polymerization and system and client
CN103500089A (en) Small file storage system suitable for Mapreduce calculation model
CN103473314A (en) Key value pair storing method and device based on shared memory
CN104123237A (en) Hierarchical storage method and system for massive small files
CN110297810B (en) Stream data processing method and device and electronic equipment
CN104199899A (en) Method and device for storing massive pictures based on Hbase
CN105138282A (en) Storage space recycling method and storage system
CN107704633A (en) A kind of method and system of file migration
CN106599091A (en) Storage and indexing method of RDF graph structures stored based on key values
CN102253985B (en) File system data management method and system
CN102970349B (en) A kind of memory load equalization methods of DHT network
CN104158902A (en) Method and device of distributing Hbase data blocks based on number of requests
CN104883394A (en) Method and system for server load balancing
CN106775450B (en) A kind of data distribution method in mixing storage system
CN112817982B (en) Dynamic power law graph storage method based on LSM tree
CN104537023A (en) Storage method and device for reverse index records

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant