CN103593436B - file merging method and device - Google Patents

file merging method and device Download PDF

Info

Publication number
CN103593436B
CN103593436B CN201310561317.8A CN201310561317A CN103593436B CN 103593436 B CN103593436 B CN 103593436B CN 201310561317 A CN201310561317 A CN 201310561317A CN 103593436 B CN103593436 B CN 103593436B
Authority
CN
China
Prior art keywords
file
data
key assignments
node
subregion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310561317.8A
Other languages
Chinese (zh)
Other versions
CN103593436A (en
Inventor
包海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310561317.8A priority Critical patent/CN103593436B/en
Publication of CN103593436A publication Critical patent/CN103593436A/en
Application granted granted Critical
Publication of CN103593436B publication Critical patent/CN103593436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS

Abstract

The embodiment of the invention provides a file merging method and device. The file merging method comprises the steps of carrying out partitioning on sections in the charge of nodes in a cluster system according to key value information of user data, determining that each node meets a first triggering condition, reading at least two first files from disks of the nodes to caches corresponding to the nodes, respectively determining the target partition which each key value belongs to according to the key values of the data, corresponding to users, stored in the first files, merging the data with the same key value, and storing the merged data corresponding to the key values into the corresponding target partitions. Partitioning is carried out on the sections in the nodes, the data are merged once, and the data with the same key value are stored in the same partition. In the inquiry process, the partitions where the data are located are determined according to the key values, and then the files in the partitions are scanned. As the number of the files in each partition is small, data screening only needs to be carried out on fewer files, and therefore the read performance is improved.

Description

Piece file mergence method and apparatus
Technical field
The present embodiments relate to data communication technology, more particularly, to a kind of Piece file mergence method and apparatus.
Background technology
Constantly develop with the Internet, the scale of internet, applications goes from strength to strength, the data stock that these applications rely on Storage faces increasing challenge.Traditional relational data has been difficult to meet the storage demand of mass data, non-relational Data base NoSql applies and gives birth to, for example:The BigTable of Google exploitation, the Cassandra of Facebook is non-relational Data base.Generally non-relational database is a distributed system, the data distribution that it is stored on each node, at present Non-relational database is all by concordance Hash mostly(hash)To realize, so-called consistent hashing, to be by hash function All of hash value forms the ring connecing that joins end to end(Maximum is connected with minima), and non-relational database cluster Each of node be responsible for the part of this ring, likewise, Hash is also carried out to the data needing storage, by cryptographic Hash The node of responsible data storage just can be found, thus reached data storage and the correspondence of node.
For the data storage physical storage structure of each node, traditional relevant database has fixing block, The read-write of data can be repeated, and non-relational database, in order to ensure concurrent write performance, employs disk random write mould Formula, with a logical data file on disk for minimum data storage cell, does not delete legacy data, but passes through timestamp New and old determine up-to-date data, this data persistence mode different from relevant database is current many non-relational numbers Adopt according to place.Fig. 1 is that the data of non-relational database deposits schematic diagram, as shown in figure 1, when writing data, first will Data writes memory table(memory table)In, when the data in memory table is full, the data to be written in memory table is passed through Flush mode writes disk becomes a file group, and the data output format of this document group can be ordered into string table (SSTable, Sorted String Table)'s.Each file group includes one group of file, and each file is respectively used to storage and uses User data, the index information of file, the hash algorithm of key assignments, static statistics file.As shown in figure 1, by memory table(To be written)In Data be currently written in file group n.The data of same subscriber may be dispersed in multiple different files, with file It is on the increase, data each time is read to be required for carrying out data screening from multiple files, by contrasting the time of identical recordings Stamp is new and old, could determine to need to return which bar record to client.Under such a scenario, the time is more long, and data file is more, The reading performance of that whole data base has and exponentially declines.Therefore have been proposed in many non-relational database products And the method achieving data merging, identical using key assignments, data is compared by merging by being distributed in scattered data file Relatively(MAP-REDUCE)It is incorporated in a big data file, compared by multiple merging, quantity of documents is reduced, come with this The reading performance of lifting data base.In prior art, even if by the method for Piece file mergence, the number of the file that node stores is still So larger, data each time is read to be required for carrying out data screening from multiple files, and the reading performance of whole data base is still not High.
Content of the invention
The embodiment of the present invention provides a kind of Piece file mergence method and apparatus, can lift the reading performance of file.
First aspect present invention provides a kind of Piece file mergence method, including:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively, described Each subregion of each node is corresponded with the key assignments of user data;
For each node, determine that described node meets the first trigger condition, read from the disk of described node to To the corresponding caching of described node, described each first file did not carried out merging to few two the first files, described each first literary composition Be stored with part the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Mesh belonging to described each key assignments is determined respectively according to the key assignments of the corresponding data of user each in described each first file Mark subregion;
The data in each first file according to described each key-value pair with identical key assignments merges, and after merging The corresponding data storage of described each key assignments is in corresponding target partition.
In the first possible implementation of first aspect present invention, described respectively use according in described each first file The key assignments of the corresponding data in family determines the target partition belonging to described each key assignments respectively, including:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
In the possible implementation of the second of first aspect present invention, each target literary composition according to described each key-value pair The data in part with identical key assignments merges, and after merging the corresponding data storage of described each key assignments in corresponding target After in subregion, also include:
Determine that described target partition meets the second trigger condition, read from the disk of the node that described target partition is located To in cache, be stored with described each second file at least two second files in described target partition at least one user couple The data answered, the key assignments of the corresponding data of each user is different;
The data in each second file described in key-value pair according to the corresponding data of described each user with identical key assignments is entered Row merges, and after merging the corresponding data storage of described each key assignments in the 3rd file of described target partition.
The first in conjunction with first aspect present invention and first aspect and the possible implementation of second, in the present invention In the third possible implementation of first aspect, in each file destination according to described each key-value pair, there is identical key assignments Data merge, and after the corresponding data storage of described each key assignments is in corresponding target partition after merging, also wrap Include:
The key assignments of data to be checked when receiving data inquiry request, is obtained according to described data inquiry request;
The subregion to be checked that described data to be checked is located is determined according to the key assignments of described data to be checked, is treated according to described The key assignments of inquiry data scans the All Files in described subregion to be checked, and the key assignments of the described data to be checked of acquisition is corresponding to be treated Inquiry data.
In the 4th kind of possible implementation of first aspect present invention, the key assignments according to described data to be checked determines The subregion to be checked that described data to be checked is located, including:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
Determine to be checked point of described data place to be checked according to the corresponding cryptographic Hash of key assignments of described data to be checked Area.
In the 5th kind of possible implementation of first aspect present invention, described determination described node satisfaction first triggering Condition, including:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its In, the value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition.
In the 6th kind of possible implementation of first aspect present invention, described determination described node satisfaction first triggering Condition, including:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its In, the value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition.
In the 7th kind of possible implementation of first aspect present invention, described determine that described target partition meets second Trigger condition, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition.
In the 8th kind of possible implementation of first aspect present invention, described determine that described target partition meets second Trigger condition, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition.
In the 9th kind of possible implementation of first aspect present invention, the described key value information pair according to user data The interval that in group system, each node is responsible for carries out subregion respectively, including:
If the interval that described group system is responsible for is(Min, min+2^127], described group system is responsible for Interval is split according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents that described group system is responsible for Interval minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and min takes Value is the positive integer more than or equal to 0, and N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for is(R1, r2], then according to described subregion step-length S to institute State node and carry out subregion, obtain/S interval of | r1 r2 |, interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval described represents institute for a subregion of described node, r1 State the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 Value be all positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
Second aspect present invention provides a kind of Piece file mergence device, including:
Division module, interval difference node each in group system being responsible for for the key value information according to user data Carry out subregion, each subregion of described each node is corresponded with the key assignments of user data;
File read module, for for each node, determining described node satisfaction the first trigger condition, from described section At least two first files are read to the corresponding caching of described node, described each first file did not carried out conjunction in the disk of point And, be stored with described each first file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Subregion determining module, for determining institute respectively according to the key assignments of the corresponding data of user each in described each first file State the target partition belonging to each key assignments;
File combination module, the data for having identical key assignments in the first file each according to described each key-value pair is entered Row merges, and after merging the corresponding data storage of described each key assignments in corresponding target partition.
In the first possible implementation of second aspect present invention, described subregion determining module specifically for:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
In the possible implementation of the second of second aspect present invention, described read module is additionally operable to:
Determine that described target partition meets the second trigger condition, read from the disk of the node that described target partition is located To in cache, be stored with described each second file at least two second files in described target partition at least one user couple The data answered, the key assignments of the corresponding data of each user is different;
Described file combination module, is additionally operable to each second file described in key-value pair according to the corresponding data of described each user In there is the data of identical key assignments merge, and after merging the corresponding data storage of described each key assignments in described target partition The 3rd file in.
The first in conjunction with second aspect present invention and second aspect and the possible implementation of second, in the present invention In the third possible implementation of second aspect, also include:Receiver module, key assignments acquisition module and enquiry module;
Described receiver module, for receiving data inquiry request;
Described key assignments acquisition module, for when described receiver module receives data inquiry request, according to described data Inquiry request obtains the key assignments of data to be checked;
Described subregion determining module is additionally operable to:Determine that described data to be checked is located according to the key assignments of described data to be checked Subregion to be checked;
Described enquiry module, scans all literary compositions in described subregion to be checked for the key assignments according to described data to be checked Part, obtains the corresponding data to be checked of key assignments of described data to be checked.
In the 4th kind of possible implementation of second aspect present invention, described subregion determining module specifically for:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
Determine to be checked point of described data place to be checked according to the corresponding cryptographic Hash of key assignments of described data to be checked Area.
In the 5th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its In, the value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition, read at least two the from the disk of described node One file is to the corresponding caching of described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its In, the value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition, read at least two the from the disk of described node One file is to the corresponding caching of described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition, the magnetic of the node being located from described target partition Read in disk at least two second files extremely caching in described target partition.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition, the node being located from described target partition Read in disk at least two second files extremely caching in described target partition.
In the 9th kind of possible implementation of second aspect present invention, described division module specifically for:
If the interval that described group system is responsible for is(Min, min+2^127], described group system is responsible for Interval is split according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents that described group system is responsible for Interval minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and min takes Value is the positive integer more than or equal to 0, and N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for is(R1, r2], then according to described subregion step-length S to institute State node and carry out subregion, obtain/S interval of | r1 r2 |, interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval described represents institute for a subregion of described node, r1 State the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 Value be all positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
The Piece file mergence method and apparatus of the embodiment of the present invention, carries out subregion by the interval that node is responsible for, when full During foot the first trigger condition, once being merged without the file merging on triggering node, after a secondary data merges, The user data with identical key assignments is stored in same subregion, makes the particle size reduction that data is deposited.In inquiry, root first Determine the subregion that data is located according to key assignments, the data needing in each data file query of affiliated subarea-scanning, due to literary composition in subregion Part number less it is only necessary to carry out data screening from less file, thus lifting reading performance.
Brief description
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description are these Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, acceptable Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is that the data of non-relational database deposits schematic diagram;
Fig. 2 is aggregated pattern system diagram;
Fig. 3 is a kind of schematic diagram of Piece file mergence in prior art;
Fig. 4 is the flow chart of Piece file mergence embodiment of the method one of the present invention;
Fig. 5 is the flow chart of Piece file mergence embodiment of the method two of the present invention;
The partition of nodes schematic diagram that Fig. 6 is suitable for by the present embodiment;
Fig. 7 is the structural representation of Piece file mergence device embodiment one of the present invention;
Fig. 8 is the structural representation of Piece file mergence device embodiment two of the present invention;
Fig. 9 is the structural representation of Piece file mergence device embodiment three of the present invention.
Specific embodiment
Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is The a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment being obtained under the premise of not making creative work, broadly falls into the scope of protection of the invention.
Before introducing various embodiments of the present invention, first simply introduce the field that various embodiments of the present invention are suitable for Scape, various embodiments of the present invention are primarily adapted for use in non-relational database, non-relational database can by concordance Hash Lai Realize, so-called consistent hashing, be that all of for hash function hash value is formed the Hash ring connecing that joins end to end, this Hash First of ring is maximum, and last is minima, and that is, maximum is connected with minima, all Kazakhstan on this Hash ring Uncommon value constitutes the interval of non-relational database group system, and each of cluster node is responsible for the part of this Hash ring Interval, that is, on each node, the corresponding cryptographic Hash of data of storage must fall in the interval that this node is responsible for, likewise, to need Data to be stored is also carried out Hash, just can find the node of responsible data storage by cryptographic Hash, has thus reached number Correspondence according to storage and node.
Fig. 2 is aggregated pattern system diagram, as shown in Fig. 2 the aggregated pattern system in Fig. 2 has four nodes, in figure is breathed out On uncommon ring, four great circles of distribution represent four nodes respectively, and this four great circles are the circle pointed by dotted arrow, four Individual node is respectively node 1, node 2, node 3, node 4, and corresponding one of each node caches and disk, between each great circle Small circle represent user data.The interval of the cryptographic Hash that group system is responsible for is(0,2^32], according to clockwise, save Interval between point 4 and node 1 is the interval that node 1 is responsible for, and the interval between node 1 and node 2 is the area that node 2 is responsible for Between, the interval between node 2 and node 3 is the interval that node 3 is responsible for, and the interval between node 3 and node 4 is negative by node 4 The interval of duty.The cryptographic Hash in the interval that each node is responsible for is different, when each node receives data, first obtains data institute right The key value information of the user answering, the cryptographic Hash corresponding to calculation key, judge which node institute is the corresponding cryptographic Hash of key assignments fall in Responsible is interval interior, and user data is stored on this node.Fig. 2 is that citing illustrates, and the node of group system may Can be more.
Fig. 4 is the flow chart of Piece file mergence embodiment of the method one of the present invention, and the method that the present embodiment provides is by each node Execute respectively, each node closes to the file on oneself node as follows according to the method that the present embodiment provides And, as shown in figure 4, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 101, carried out respectively according to the interval that the key value information of user data is responsible for node each in group system Subregion.
The interval that node is responsible for carries out subregion, refer to according to certain rule, the interval division that node is responsible for is Less interval, each interval after division is a subregion, and this subregion is exactly this interval subinterval, at this In each bright embodiment, subregion is exactly certain interval subinterval.The method that the present embodiment provides, by being born to each node The interval of duty is further divided into less subregion, and each subregion is responsible for a less interval, each subregion and for data Key assignments correspond.
Step 102, for each node, determine this node meet the first trigger condition, from the disk of this node read Take at least two first files to the corresponding caching of this node, each first file did not carried out merging, deposited in each first file Contain the corresponding data of at least one user, the key assignments of the corresponding data of each user is different.
Special module is had to be responsible for the merging of file on each node, this module passes through to determine that meeting first on this node touches Clockwork spring part, triggers Piece file mergence task when this node meets the first trigger condition.Determine whether this node meets the first triggering Condition, specifically, judging on this node whether the number of the first file of storage reaches default first Piece file mergence number M1, The value of M1 is the positive integer more than or equal to 2, if it is determined that node meets the first trigger condition.Default first file closes And number is, for example, 4, then when the first number of files on node reaches 4, just trigger the merging task of file, here the first literary composition Part refers to the file not carrying out merging, and when this node receives user data, user data is stored the literary composition on node In part, this node is also stored with some carried out merge file, if the last each file of node meet this first touch Clockwork spring part, then trigger merging task.
Certainly, the first trigger condition can also include other default conditions, such as when the first file on this node Number reaches default Piece file mergence number, determines whether each first file meets the according to the size of each first file further One merging condition;If it is determined that node meets the first trigger condition.Here, determine each according to the size of each first file Whether one file meets the first merging condition, specially judges that the difference in size of each first file meets default threshold value, if respectively The difference in size of the first file meets default threshold value, just triggers Piece file mergence task, if each first merging file size is poor Different very big, then do not carry out Piece file mergence, to illustrate especially by an example, if there are 4 the first files, respectively with 1,2,3, 4 expressions, file 1 size is 100M, and file 2 size is 200M, and file 3 size is 300M, and file 4 size is 50M, when reading After file 1 and file 2, the meansigma methodss taking file 1 and file 2 size are 150M, are then multiplied by the maximum weighted factor with meansigma methodss With the minimum weight factor it is generally the case that the maximum weighted factor is 1.5, the minimum weight factor is 0.5, in the present embodiment, averagely Value is multiplied by the maximum weighted factor and the minimum weight factor respectively obtains 75M and 225M, if the size of file 3 falls【75M, 225M】In interval, then the size to fit of supporting paper 3 meets merging condition, can merge with file 1,2, the present embodiment Middle file 3 size be 300M beyond【75M, 225M】Interval, so being unsatisfactory for merging condition, then can be with same method Continue the size of comparison document 4, in the present embodiment, file 4 size is 50M although not falling within【75M, 225M】In interval, but Being due to very little of file 4 itself, even if merging also taken how many resources, therefore, can be unsatisfactory for for small documents above-mentioned Interval, can arrange a threshold value, if file is less than this threshold value, directly file is merged, for example, arrange threshold value For 50M.Here, instruction to illustrate if it is determined that whether each file size difference meets default threshold value for an example, when So can also be judged by additive method, will not enumerate here.
Step 103, the mesh belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file Mark subregion.
After reading the first file to be combined, determined respectively respectively according to the key assignments of the corresponding data of user each in the first file Target partition belonging to key assignments, that is, judge which subregion the key assignments of the corresponding data of each user falls in.Corresponded to according to each user The key assignments of data determine target partition belonging to each key assignments respectively, specially:First, the key according to the corresponding data of each user Value calculates the corresponding cryptographic Hash of each key assignments respectively;Then, the target belonging to each key assignments is determined according to the corresponding cryptographic Hash of each key assignments Subregion, the corresponding cryptographic Hash of different key assignments may fall in different subregions.
Step 104, merged according to the data in each first file of each key-value pair with identical key assignments, and after merging The corresponding data storage of each key assignments is in corresponding target partition.
In this step, the data in each first file with identical key assignments is merged, the data of such as party A-subscriber is respectively It is stored in file 1, file 2 and file 3, then reads the data of party A-subscriber from these three files respectively, the data of party A-subscriber exists Key assignments in three files is identical, then be stored in target partition after merging the data of party A-subscriber, and target partition is to be used according to A The cryptographic Hash of the corresponding key assignments of user data determines.After the completion of Piece file mergence, the data storage after merging can be divided in target In second file in area, for example, it is stored in second file of target partition A, each file can correspond to a static statistics File, static statistics file is used for the related information of this document that is stored with, the such as time of data write, and size of file etc. is believed Breath, in the present embodiment, by the corresponding relation of this second file and said target subregion is saved in static statistics file, after Continue when node reboot, this second file directly can be loaded into by this target partition pair according to static statistics file In the partitioned file list answered, in partitioned file list, record has the information of All Files in this target partition, such as this target The storage address information that each file in subregion is located on disk respectively, when node reboot, according to static statistics file All of file in target partition is all carried in the corresponding partitioned file list of this target partition, is carrying out data query When, according to the corresponding relation of this second file in static statistics file and said target subregion, find this partitioned file row Table, then reads data according to this partitioned file list from disk.
If it should be noted that the data of user A is only stored in file 1, not storing A in file 2 and file 3 The data of user, at this moment, actually during merging, for the data of user A, does not merge, simply by user A Data according to the corresponding target partition of key assignments, by the second file on the data Cun Chudao target partition of party A-subscriber, only when In two the first files be stored with key assignments identical data when, just data is merged.
The method that the present embodiment provides, carries out subregion by the interval that node is responsible for, when meeting the first trigger condition When, merging without the file merging on triggering node, be stored with each file the corresponding data of at least one user, The key assignments of the corresponding data of each user is different, calculates the corresponding Kazakhstan of each key assignments respectively according to the key assignments of the corresponding data of each user Uncommon value, and determine the target partition belonging to the corresponding cryptographic Hash of each key assignments, then will there is identical key assignments in file to be combined Data merge, and the target partition according to belonging to the corresponding cryptographic Hash of this key assignments, should by data Cun Chudao after merging In target partition.By subregion is carried out to node inner region, it is then passed through a secondary data and merges, will there is the user of identical key assignments Data storage, in same subregion, makes the particle size reduction that data is deposited.When queried, determine what data was located according to key assignments first Subregion, the data needing in each data file query of affiliated subarea-scanning, due in subregion file number less it is only necessary to from relatively Carry out data screening, thus reaching reading performance lifting in few file.
Fig. 5 is the flow chart of Piece file mergence embodiment of the method two of the present invention, and the method that the present embodiment provides is by each node Execute respectively, each node closes to the file on oneself node as follows according to the method that the present embodiment provides And, compare with embodiment one, the present embodiment on the basis of embodiment one, when meeting merging condition in subregion, to this subregion Interior file merges further.As shown in figure 5, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 201, carried out respectively according to the interval that the key value information of user data is responsible for node each in group system Subregion.
The interval that node is responsible for carries out subregion, refer to according to certain rule, the interval division that node is responsible for is Less interval, each interval after division is a subregion.In the present embodiment, subregion can be carried out in the following manner:As The interval that fruit group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is torn open according to 2^N Point, obtain subregion step-length S, S=2^127/2^N, wherein, min represents the minimum hash in the interval that group system is responsible for, min + 2^127 represents the maximum cryptographic Hash in the interval that group system is responsible for, and the value of min is the positive integer more than or equal to 0, and N is to divide Area's factor, the value of N is the positive integer more than or equal to 1, and district factor N can be configured according to actual needs.
For each node, if the interval that node is responsible for is(R1, r2], then according to subregion step-length S, node is carried out Subregion, obtains, and | r1 r2 |/S is interval, and interval closed at the right is opened on sequentially adjacent one left side of two interval formation(Rn, rn+1], its In, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents, for a subregion of node, r1, the interval that node is responsible for Minimum hash, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, and the value of r1 and r2 more than or equal to 0 is just all Integer, and r2 is more than r1, | r1 r2 | represents the absolute value taking r1 and r2 difference.
Merely just list a kind of partition method, in subregion, the size of each subregion can be the same or different, The present invention is not limited to this.
Hereinafter will be illustrated by a specific example, the partition of nodes schematic diagram that Fig. 6 is suitable for by the present embodiment, such as scheme Shown in 6, upper in figure circle represents whole group system, has four nodes in group system, is node A, node B, section respectively Point C, node D, the scope of the corresponding cryptographic Hash in interval that node A is responsible for is(70,10], wherein, 70 and 10 expression cryptographic Hash, section The interval that point B is responsible for is(10,30], the interval that node C is responsible for is(30,50], the interval that node D is responsible for is(50,70].To save Illustrate as a example point B, the interval being responsible for node B first is(10,30] it is divided into three subregions, respectively subregion (slice)B1, subregion B2 and subregion B3, the interval of subregion B1 is(10,15], the interval of subregion B2 is(15,25], subregion B3 Interval is(25,30].5 the first files to be combined are had on node B, this 5 Piece file mergences is 3 the second files, merges 3 the second files afterwards are belonging respectively to different subregions, the second file after merging is respectively stored into corresponding target is divided Area, as shown in Figure 6, the direction of arrow instruction represents the target partition belonging to each second file, below will be by specific Example is illustrating it is assumed that there being 5 the first files to be combined to be respectively on node B:File 1, file 2, file 3, file 4 and literary composition Part 5, wherein, the data of three users that file 1 is stored with, the corresponding key assignments of data of user A1 is a1, the data of user A2 Corresponding key assignments is a2, the corresponding key assignments of data of user A3 is a3;Be stored with file 2 data of user A1 and A2, file 3 On be stored with the data of user A1, the data of user A4 and A5 that be also stored with, the corresponding key assignments of data of user A4 is a4, user The corresponding key assignments of data of A5 is a5, the data of user A3 and user A4 that file 4 is stored with, user A2 that file 5 is stored with, The data of A3, A4, A5.Here with key assignments, the data as a1 illustrates as a example merging, and first, calculates according to hash algorithm The corresponding cryptographic Hash of key assignments a1, it is then determined which subregion is the corresponding cryptographic Hash of key assignments a1 fall in it is assumed that key assignments a1 is corresponding Cryptographic Hash falls in subregion B1, then subregion B1 is defined as the target partition belonging to key assignments a1, finally, by file 1, file 2 and In file 3, the data for a1 for the key assignments merges, and by the data storage after merging in the file 6 on subregion B1.According to same The method of sample, merges for a2, a3, a4 and a5 to key assignments respectively it is assumed that the target partition belonging to key assignments a2 and a5 is subregion B2, the target partition belonging to key assignments a1 and a3 is subregion B1, and the target partition belonging to key assignments a4 is subregion B3, then by key assignments a1 and In file 6, file 6 belongs to subregion B1 to data storage after a3 merging, and the data storage after key assignments a2 and a5 is merged is in literary composition In part 7, file 7 belongs to B2, and in file 8, file 8 belongs to B3 to the data storage after key assignments a4 is merged, file 6, file 7, File 8 is all the second file.By above-mentioned merging, the data in five the first files is merged in three the second files, Each second file belongs to different sections, thus the data of user A1, A3 is stored on subregion B1 by merging, user The data of A2, A5 is stored on subregion B2 by merging, and the data of user A4 is stored on subregion B3 after merging.
Step 202, for each node, determine this node meet the first trigger condition, from the disk of this node read Take at least two first files to the corresponding caching of this node.
In this step, the first file refers to the file not carrying out merging, and be stored with each first file at least one use The corresponding data in family, the key assignments of the corresponding data of each user is different.Specific implementation can refer to a kind of step 102 of embodiment In description.
Step 203, the mesh belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file Mark subregion.
The corresponding data of different user can be distinguished by key assignments, for each user, all of data of this user Key assignments all identical, user data and key assignments correspond, and find the corresponding data of this user in inquiry according to key assignments.This reality Apply in example, determine that the target partition belonging to each key assignments is specially respectively according to the key assignments of the corresponding data of each user:First, according to The key assignments of the corresponding data of each user calculates the corresponding cryptographic Hash of each key assignments respectively, after obtaining the corresponding cryptographic Hash of key assignments, sentences Which subregion is each cryptographic Hash of breaking be belonging respectively to, and each subregion is responsible for an interval, the subregion belonging to the corresponding cryptographic Hash of each key assignments It is the target partition belonging to each key assignments.
Step 204, merged according to the data in each first file of each key-value pair with identical key assignments, and after merging The corresponding data storage of each key assignments is in corresponding target partition.
In this step, the data in each first file with identical key assignments is merged, and the data storage after merging is existed In second file of target partition.After the completion of merging, the relation of this second file and affiliated target partition is stored in this In second file corresponding static statistics file, each file corresponds to a static statistics file, subsequently works as node again When startup, this second file directly can be loaded into by the corresponding partitioned file of this target partition according to static statistics file In list, when carrying out data query, according to the pass of this second file in static statistics file and affiliated target partition System, finds the partitioned file list of this target partition, is then read from disk according to the partitioned file list of this target partition Data.
Step 205, determine target partition meet the second trigger condition, from target partition be located node disk read At least two second files in target partition are in cache.
In above-mentioned steps 201-204, by carrying out subregion to node inner region, the granularity that data is deposited reduces again, The file not carrying out merging once is merged, by once merging, data is distributed in different subregions.And this step In rapid, when the file in certain subregion meets merging condition, the file in subregion is merged again, each subregion is all Independent, when merging, only the file in this subregion is merged.Specifically, when target partition meets the second triggering bar During part, read at least two second files to be combined in target partition from the disk of the node that target partition is located to slow In depositing, be stored with each second file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different.
In the present embodiment, determine whether target partition meets the second trigger condition, specifically, judging storage on target partition The number of the second file whether reach default second Piece file mergence number M2, the value of M2 is the positive integer more than or equal to 2, Certainly, the second trigger condition can also include other default conditions, such as when the number of the second file on this target partition Reach default second Piece file mergence number, determine whether each second file meets according to the size of each second file further Two merging conditions, determine whether each second file meets the second merging condition and refer specifically to judge that the difference in size of each second file is No meet default max-thresholds, just trigger Piece file mergence and appoint if each second file size difference meets default max-thresholds Business, if each second file size is widely different, does not carry out Piece file mergence, in the present embodiment, judges each second file size Whether difference meets default max-thresholds, can be default using judging whether each first file size difference meets in embodiment one Max-thresholds method, repeat no more here, but specific parameter setting can be different.For example, if one second File size is 500M, and another second file size is 30M, then do not merge.Default Piece file mergence number M2 can Arrange according to the actual needs, if had high demands to query rate, can be smaller by the value setting of M2, increase merging time Number, to reduce the number of file, improves search efficiency.
Step 206, entered according to the data in each second file of key-value pair of the corresponding data of each user with identical key assignments Row merges, and after merging the corresponding data storage of each key assignments in the 3rd file of target partition.
In this step, the data in the second file to be combined each in target partition with identical key assignments is incorporated in same So that data high polymeric in same subregion in individual 3rd file.Fig. 3 is a kind of schematic diagram of Piece file mergence in prior art, As shown in figure 3, after file number is more than 4, merging task will be initiated, 4 Piece file mergences are become a new file, If have new Generating Data File again, the close data file of size can be chosen and merges by the comparison of size.Tool Body is achieved in that, all sstable under same table is grouped by size, sstable similar for size is classified as One group, so it is formed for n group(n>=1), then obtain minimum group one task of formation of mean size from this n group Carry out the operation of compaction.Every group of size must be(4,32], in the range of, the excessive group of quantity will be blocked, it By way of map-reduce, the data having identical key assignments is merged together afterwards, thus forming new data file.And should Multi-thread concurrent execution multiple tasks are supported, the data file that each task is chosen differs in method.But prior art There is problems with, the trigger condition that data file merges is low, and 4 files can trigger, daily generation merges often, greatly The Piece file mergence of amount can take the resources such as sizable internal memory, CPU, I/O, and after data file merges through excessive wheel, key Value has reached at a relatively high degree of integration, the merging now carrying out again, is only by the resettlement of key assignments data, only a small amount of Merging action, causes the waste of system resource.The present embodiment provide method by big file declustering is become little partitioned file, During merging only when the file in subregion meets merging condition, just the file in subregion is merged, due to literary composition in subregion Part number seldom merges number of times such that it is able to reduce with respect to the file number on a node in prior art, improves and merges Efficiency, in the case that data is evenly distributed, on each subregion, the file size of storage is identical, the combined efficiency of same quantity of data More than 30% can be lifted.
Step 207, when receiving data inquiry request, according to data inquiry request obtain data to be checked key assignments, The subregion to be checked that data to be checked is located is determined according to the key assignments of data to be checked, is treated according to the key assignments scanning of data to be checked All Files in inquiry subregion, obtains the corresponding data to be checked of key assignments of data to be checked.
During data query, first the subregion that data to be checked is located is determined according to the key assignments of data to be checked, then to institute The data file belonging in subregion carries out filter scan, and then will obtain qualified result and return.Specifically, when receiving During data inquiry request, parse the key value information of data to be checked, then, calculate the corresponding Hash of key assignments of data to be checked Value, determines, according to the corresponding cryptographic Hash of key assignments of data to be checked, the subregion to be checked that data to be checked is located.According to be checked The key assignments of data scans the All Files in subregion to be checked, obtains the corresponding data to be checked of key assignments and returns.
The method that this enforcement provides, additionally it is possible to lifting search efficiency, in query script, contrasts former inquiry scan side Formula, needs All Files on disk is once judged, determine data to be inquired about whether in this data file so that looking into Ask efficiency low, and need data is frequently read from disk, increased the I/O expense of system, also can take excessive simultaneously CPU and internal memory, cause the waste of resource.Assume that data is evenly distributed, data volume is 100G, query depth is 1000, if Node is divided into four subregions, subregion search efficiency can lift 25%.
The method that the present embodiment provides, by carrying out subregion to node inner region, the granularity that data is deposited reduces again, By once merging, data is distributed in different subregions, and when meeting merging condition in subregion, to the file in subregion Merge, the data in subregion with identical key assignments is merged in a file, improves the degree of polymerization of file in subregion. In query script, first the subregion that data to be inquired about is located is determined according to key assignments, the data file in affiliated subregion is entered Row filter scan, and then qualified result will be obtained and return, because file number is few in subregion, and in each file Data high polymeric, inquiry times can be reduced, improve search efficiency.
Fig. 7 is the structural representation of Piece file mergence device embodiment one of the present invention, the Piece file mergence dress that the present embodiment provides Put naturally it is also possible to be independently arranged on each node can be integrated in group system, as shown in fig. 7, the literary composition that the present embodiment provides Part merges device and includes:Division module 31, file read module 32, subregion determining module 33, file combination module 34.
Wherein, division module 31, are responsible for node each in group system for the key value information according to user data Interval carries out subregion respectively, and each subregion of each node is corresponded with the key assignments of user data;
File read module 32, for for each node, determining node satisfaction the first trigger condition, the magnetic of from node At least two first files are read to the corresponding caching of node, each first file did not carried out merging, each first file in disk In be stored with the corresponding data of at least one user, the key assignments of the corresponding data of each user is different;
Subregion determining module 33, for determining each key respectively according to the key assignments of the corresponding data of user each in each first file Target partition belonging to value;
File combination module 34, for being closed according to the data in each first file of each key-value pair with identical key assignments And, and after merging the corresponding data storage of each key assignments in corresponding target partition.
In the present embodiment, subregion determining module 33 specifically for:Calculated respectively according to the key assignments of the corresponding data of each user The corresponding cryptographic Hash of each key assignments;Determine the target partition belonging to each key assignments according to the corresponding cryptographic Hash of each key assignments.
The Piece file mergence device that the present embodiment provides, can be used for executing the technical scheme that embodiment of the method one provides, specifically Implementation and technique effect type, repeat no more here.
Fig. 8 is the structural representation of Piece file mergence device embodiment two of the present invention, the Piece file mergence dress that the present embodiment provides Put naturally it is also possible to be independently arranged on each node can be integrated in group system, as shown in figure 8, the literary composition that the present embodiment provides Part merges device and includes:Division module 41, file read module 42, subregion determining module 43, file combination module 44.
Wherein, division module 41, are responsible for node each in group system for the key value information according to user data Interval carries out subregion respectively, and each subregion of each node is corresponded with the key assignments of user data;
File read module 42, for for each node, determining node satisfaction the first trigger condition, the magnetic of from node At least two first files are read to the corresponding caching of node, each first file did not carried out merging, each first file in disk In be stored with the corresponding data of at least one user, the key assignments of the corresponding data of each user is different;
Subregion determining module 43, for determining each key respectively according to the key assignments of the corresponding data of user each in each first file Target partition belonging to value;
File combination module 44, for being closed according to the data in each first file of each key-value pair with identical key assignments And, and after merging the corresponding data storage of each key assignments in corresponding target partition.
In the present embodiment, division module 41 carries out subregion especially by the following manner to each node of group system:If The interval that group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is split according to 2^N, Obtain subregion step-length S, S=2^127/2^N, wherein, min represents the minimum hash in the interval that group system is responsible for, min+2^ The maximum cryptographic Hash in the interval that 127 expression group systems are responsible for, the value of min is the positive integer more than or equal to 0, and N is subregion The factor, the value of N is the positive integer more than or equal to 1;For each node, if the interval that node is responsible for is(R1, r2], then root According to subregion step-length S, subregion is carried out to node, obtains/S interval of | r1 r2 |, sequentially adjacent one left side of two interval formation is opened Interval closed at the right(Rn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval is a subregion of node, r1 Represent the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, r1's and r2 Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
File read module 42 specifically for:In decision node, whether the number of the first file of storage reaches default the One Piece file mergence number M1, wherein, the value of M1 is the positive integer more than or equal to 2;If it is determined that node meets the first triggering Condition, reads at least two first files to the corresponding caching of node in the disk of from node.Or, file read module 42 Specifically for:In decision node, whether the number of the first file of storage reaches default first Piece file mergence number M1, wherein, The value of M1 is the positive integer more than or equal to 2;If so, determine whether each first file meets according to the size of each first file One merging condition;If it is determined that node meets the first trigger condition, in the disk of from node, read at least two first files To the corresponding caching of node.
Subregion determining module 43 specifically for:Each key assignments is calculated respectively according to the key assignments of the corresponding data of each user corresponding Cryptographic Hash;Determine the target partition belonging to each key assignments according to the corresponding cryptographic Hash of each key assignments.
For each node, by after once merging, by each subregion of file distribution to this node, the present embodiment In, when each distribution meets merging condition, further the file in subregion is merged, make the file in each subregion highly poly- Close, therefore, in the present embodiment, file read module 42 is additionally operable to:Determine that target partition meets the second trigger condition, divide from target Read in the disk of node that area is located at least two second files extremely caching in target partition, store in each second file There is the corresponding data of at least one user, the key assignments of the corresponding data of each user is different.Correspondingly, file combination module 44, It is additionally operable to the data according to having identical key assignments in each second file of key-value pair of the corresponding data of each user to merge, and will After merging, the corresponding data storage of each key assignments is in the 3rd file of target partition.
In the present embodiment, especially by following two modes, file read module 42 determines whether target partition meets second Trigger condition:The first, judge whether the number of the second file of storage on target partition reaches default second Piece file mergence Number M2, wherein, the value of M2 is the positive integer more than or equal to 2;If it is determined that target partition meets the second trigger condition, At least two second files in reading target partition from the disk of the node that target partition is located are to caching.Second, Judge whether the number of the second file of storage on target partition reaches default second Piece file mergence number M2, wherein, M2's Value is the positive integer more than or equal to 2;If so, determine whether each second file meets the second conjunction according to the size of each second file And condition;If meeting it is determined that target partition meets the second trigger condition, read from the disk of the node that target partition is located At least two second files in target partition are in cache.
Further, the Piece file mergence device of the present embodiment also includes:Receiver module 45, key assignments acquisition module 46 and inquiry Module 47, receiver module 45, for receiving data inquiry request;Key assignments acquisition module 46, for receiving number when receiver module The key assignments of data to be checked during according to inquiry request, is obtained according to data inquiry request;Subregion determining module 43 is additionally operable to:According to treating The key assignments of inquiry data determines the subregion to be checked that data to be checked is located;Enquiry module 47, for according to data to be checked Key assignments scans the All Files in subregion to be checked, obtains the corresponding data to be checked of key assignments of data to be checked.
In the present embodiment, subregion determining module 43 determines the subregion to be checked that data to be checked is located in the following manner: First, calculate the corresponding cryptographic Hash of key assignments of data to be checked, then, true according to the corresponding cryptographic Hash of key assignments of data to be checked The subregion to be checked that fixed data to be checked is located.
The Piece file mergence device that the present embodiment provides can be used for executing the technical scheme that embodiment of the method two provides, specifically real Existing mode is similar with technique effect, repeats no more here.
Fig. 9 is the structural representation of Piece file mergence device embodiment three of the present invention, the Piece file mergence dress that the present embodiment provides Put 500 to include:Processor 51, memorizer 52, receptor 53.Memorizer 52, receptor 53 are connected with processor 51 by bus. Wherein, memorizer 52 storage execute instruction, when Piece file mergence device 500 runs, communicates between processor 51 and memorizer 52, Processor 51 execution execute instruction makes Piece file mergence device 500 execute following operation:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively, each section Each subregion of point is corresponded with the key assignments of user data;
For each node, determine that node meets the first trigger condition, in the disk of from node, read at least two the To the corresponding caching of node, each first file did not carried out merging to one file, and be stored with each first file at least one use The corresponding data in family, the key assignments of the corresponding data of each user is different;
Target partition belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file;
Data according to having identical key assignments in each first file of each key-value pair merges, and each key assignments after merging Corresponding data storage is in corresponding target partition.
According in each first file the key assignments of the corresponding data of each user determine target partition belonging to each key assignments respectively, It is specially:The corresponding cryptographic Hash of each key assignments is calculated respectively according to the key assignments of the corresponding data of each user;Corresponding according to each key assignments Cryptographic Hash determines the target partition belonging to each key assignments.
Processor 51 is additionally operable to:
Determine that target partition meets the second trigger condition, read target partition from the disk of the node that target partition is located To in cache, be stored with each second file at least two second interior files the corresponding data of at least one user, each use The key assignments of the corresponding data in family is different;
Data according to having identical key assignments in each second file of key-value pair of the corresponding data of each user merges, and There is the data storage of identical key assignments in the 3rd file of target partition after merging.
Wherein, the subregion to be checked that data to be checked is located is determined according to the key assignments of data to be checked, including:Calculate to be checked Ask the corresponding cryptographic Hash of key assignments of data;Determine what data to be checked was located according to the corresponding cryptographic Hash of key assignments of data to be checked Subregion to be checked.
In the present embodiment, determine whether node meets the first trigger condition, including:First file of storage in decision node Number whether reach default first Piece file mergence number M1, wherein, the value of M1 is the positive integer more than or equal to 2;If so, Then determine that node meets the first trigger condition.
Or, in decision node, whether the number of the first file of storage reaches default first Piece file mergence number M1, Wherein, the value of M1 is the positive integer more than or equal to 2;If so, whether each first file is determined according to the size of each first file Meet the first merging condition;If it is determined that node meets the first trigger condition.
In the present embodiment, determine that target partition meets the second trigger condition, including:Judge second of storage on target partition Whether the number of file reaches default second Piece file mergence number M2, and wherein, the value of M2 is the positive integer more than or equal to 2; If it is determined that target partition meets the second trigger condition.Or, judge that the number of the second file of storage on target partition is No reach default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;If so, according to each The size of two files determines whether each second file meets the second merging condition;If meeting it is determined that target partition meets second Trigger condition.
In the present embodiment, the interval that the key value information according to user data is responsible for node each in group system is entered respectively Row subregion, including:
If the interval that group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is according to 2 ^N is split, and obtains subregion step-length S, S=2^127/2^N, wherein, min represents the minimum Kazakhstan in the interval that group system is responsible for Uncommon value, min+2^127 represents the maximum cryptographic Hash in the interval that group system is responsible for, and the value of min is just whole more than or equal to 0 Number, N is district factor, and the value of N is the positive integer more than or equal to 1;
For each node, if the interval that node is responsible for is(R1, r2], then node is carried out point according to subregion step-length S Area, obtains, and | r1 r2 |/S is interval, and interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn, rn+1], wherein, 0< n<| r1 r2 |/S 1, n are positive integer, and each interval represents the minimum in the interval that node is responsible for for a subregion of node, r1 Cryptographic Hash, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, and the value of r1 and r2 is all the positive integer more than or equal to 0, And r2 is more than r1, | r1 r2 | represents the absolute value taking r1 and r2 difference.
In the present embodiment, receptor 53 is used for receiving data inquiry request, and processor 51 is additionally operable to look into when receiving data The key assignments of data to be checked when asking request, is obtained according to data inquiry request;Determined to be checked according to the key assignments of data to be checked The subregion to be checked that data is located, the key assignments according to data to be checked scans the All Files in subregion to be checked, obtains to be checked Ask the corresponding data to be checked of key assignments of data.
The Piece file mergence device that the present embodiment provides, can be used for executing the side shown in embodiment of the method one and embodiment two Method, specific implementation is similar with technique effect, repeats no more here.
One of ordinary skill in the art will appreciate that:The all or part of step realizing above-mentioned each method embodiment can be led to Cross the related hardware of programmed instruction to complete.Aforesaid program can be stored in a computer read/write memory medium.This journey Sequence upon execution, executes the step including above-mentioned each method embodiment;And aforesaid storage medium includes:ROM, RAM, magnetic disc or Person's CD etc. is various can be with the medium of store program codes.
Finally it should be noted that:Various embodiments above only in order to technical scheme to be described, is not intended to limit;To the greatest extent Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that:Its according to So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered Row equivalent;And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme.

Claims (20)

1. a kind of Piece file mergence method is it is characterised in that include:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively;
For each node, determine that described node meets the first trigger condition, read at least two from the disk of described node To the corresponding caching of described node, described each first file did not carried out merging to individual first file, in described each first file Be stored with the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Determine that the target belonging to described each key assignments is divided respectively according to the key assignments of the corresponding data of user each in described each first file Area;
The data in each first file according to described each key-value pair with identical key assignments merges, and described after merging The corresponding data storage of each key assignments is in corresponding target partition;
Wherein, the key assignments of the corresponding data of same user is identical, and the key assignments of the corresponding data of different user is different.
2. method according to claim 1 is it is characterised in that described corresponding according to each user in described each first file The key assignments of data determines the target partition belonging to described each key assignments respectively, including:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
3. method according to claim 1 is it is characterised in that have in each first file according to described each key-value pair The data of identical key assignments merges, and after merging the corresponding data storage of described each key assignments in corresponding target partition it Afterwards, also include:
Determine that described target partition meets the second trigger condition, read described from the disk of the node that described target partition is located , in cache, at least one user that is stored with described each second file is corresponding at least two second files in target partition Data, the key assignments of the corresponding data of each user is different;
The data in each second file described in key-value pair according to the corresponding data of described each user with identical key assignments is closed And, and after merging the corresponding data storage of described each key assignments in the 3rd file of described target partition.
4. the method according to any one of claim 1-3 it is characterised in that according to described each key-value pair each first literary composition The data in part with identical key assignments merges, and after merging the corresponding data storage of each key assignments in corresponding target partition After interior, also include:
The key assignments of data to be checked when receiving data inquiry request, is obtained according to described data inquiry request;
The subregion to be checked that described data to be checked is located is determined according to the key assignments of described data to be checked, according to described to be checked The key assignments of data scans the All Files in described subregion to be checked, and the key assignments obtaining described data to be checked is corresponding to be checked Data.
5. method according to claim 4 it is characterised in that determine described to be checked according to the key assignments of described data to be checked Ask the subregion to be checked that data is located, including:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
The subregion to be checked that described data to be checked is located is determined according to the corresponding cryptographic Hash of key assignments of described data to be checked.
6. method according to claim 1 is it is characterised in that the described node of described determination meets the first trigger condition, bag Include:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein, The value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition.
7. method according to claim 1 is it is characterised in that the described node of described determination meets the first trigger condition, bag Include:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein, The value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition.
8. method according to claim 3 is it is characterised in that the described target partition of described determination meets the second triggering bar Part, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its In, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition.
9. method according to claim 3 is it is characterised in that the described target partition of described determination meets the second triggering bar Part, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its In, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition.
10. method according to claim 1 is it is characterised in that the described key value information according to user data is to cluster system The interval that in system, each node is responsible for carries out subregion respectively, including:
According to the key value information of described user data, calculate the cryptographic Hash corresponding to key value information of described user data;
The cryptographic Hash corresponding to key value information according to described user data, the interval that node each in group system is responsible for is divided Do not carry out subregion;
The described interval that node each in group system is responsible for carries out subregion respectively, including:
If the interval that described group system is responsible for be (min, min+2^127], the interval that described group system is responsible for Split according to 2^N, obtained subregion step-length S, S=2^127/2^N, wherein, min represents the area that described group system is responsible for Between minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and the value of min is Positive integer more than or equal to 0, N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for be (r1, r2], then according to described subregion step-length S to described section Point carries out subregion, obtains/S interval of | r1 r2 |, and interval closed at the right (r is opened on sequentially adjacent one left side of two interval formationn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents described section for a subregion of described node, r1 The minimum hash in the interval that point is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 takes Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
A kind of 11. Piece file mergence devices are it is characterised in that include:
Division module, interval node each in group system being responsible for for the key value information according to user data is carried out respectively Subregion;
File read module, for for each node, determining described node satisfaction the first trigger condition, from described node At least two first files are read to the corresponding caching of described node, described each first file did not carried out merging in disk, Be stored with described each first file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Subregion determining module, described each for being determined respectively according to the key assignments of the corresponding data of user each in described each first file Target partition belonging to key assignments;
File combination module, the data for having identical key assignments in the first file each according to described each key-value pair is closed And, and will merge after described each key assignments corresponding data storage in corresponding target partition;
Wherein, the key assignments of the corresponding data of same user is identical, and the key assignments of the corresponding data of different user is different.
12. devices according to claim 11 it is characterised in that described subregion determining module specifically for:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
13. devices according to claim 11 are it is characterised in that described file read module is additionally operable to:
Determine that described target partition meets the second trigger condition, read described from the disk of the node that described target partition is located , in cache, at least one user that is stored with described each second file is corresponding at least two second files in target partition Data, the key assignments of the corresponding data of each user is different;
Described file combination module, is additionally operable to have in each second file described in key-value pair according to the corresponding data of described each user The data having identical key assignments merges, and after merging the corresponding data storage of described each key assignments the of described target partition In three files.
14. devices according to any one of claim 11-13 are it is characterised in that also include:Receiver module, key assignments obtain Module and enquiry module;
Described receiver module, for receiving data inquiry request;
Described key assignments acquisition module, for when described receiver module receives data inquiry request, according to described data query The key assignments of acquisition request data to be checked;
Described subregion determining module is additionally operable to:Waiting of described data place to be checked is determined according to the key assignments of described data to be checked Inquiry subregion;
Described enquiry module, scans the All Files in described subregion to be checked for the key assignments according to described data to be checked, Obtain the corresponding data to be checked of key assignments of described data to be checked.
15. devices according to claim 14 it is characterised in that described subregion determining module specifically for:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
The subregion to be checked that described data to be checked is located is determined according to the corresponding cryptographic Hash of key assignments of described data to be checked.
16. devices according to claim 11 it is characterised in that described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein, The value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition, read at least two first literary compositions from the disk of described node Part is to the corresponding caching of described node.
17. devices according to claim 11 it is characterised in that described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein, The value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition, read at least two first literary compositions from the disk of described node Part is to the corresponding caching of described node.
18. devices according to claim 13 it is characterised in that described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its In, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition, from the disk of the node that described target partition is located Read at least two second files extremely caching in described target partition.
19. devices according to claim 13 it is characterised in that described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its In, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition, the disk of the node being located from described target partition Middle at least two second files reading in described target partition are in cache.
20. devices according to claim 11 it is characterised in that described division module specifically for:
According to the key value information of described user data, calculate the cryptographic Hash corresponding to key value information of described user data;
The cryptographic Hash corresponding to key value information according to described user data, the interval that node each in group system is responsible for is divided Do not carry out subregion;
If the interval that described group system is responsible for be (min, min+2^127], the interval that described group system is responsible for Split according to 2^N, obtained subregion step-length S, S=2^127/2^N, wherein, min represents the area that described group system is responsible for Between minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and the value of min is Positive integer more than or equal to 0, N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for be (r1, r2], then according to described subregion step-length S to described section Point carries out subregion, obtains/S interval of | r1 r2 |, and interval closed at the right (r is opened on sequentially adjacent one left side of two interval formationn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents described section for a subregion of described node, r1 The minimum hash in the interval that point is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 takes Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
CN201310561317.8A 2013-11-12 2013-11-12 file merging method and device Active CN103593436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310561317.8A CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310561317.8A CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Publications (2)

Publication Number Publication Date
CN103593436A CN103593436A (en) 2014-02-19
CN103593436B true CN103593436B (en) 2017-02-08

Family

ID=50083577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310561317.8A Active CN103593436B (en) 2013-11-12 2013-11-12 file merging method and device

Country Status (1)

Country Link
CN (1) CN103593436B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046638A (en) * 2018-12-29 2019-07-23 阿里巴巴集团控股有限公司 Fusion method, device and the equipment of multi-platform data

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824B (en) * 2014-11-10 2017-12-12 北京奇虎科技有限公司 The method and apparatus that a kind of data merge
CN106326309B (en) * 2015-07-03 2020-02-21 阿里巴巴集团控股有限公司 Data query method and device
CN105159915B (en) * 2015-07-16 2018-07-10 中国科学院计算技术研究所 The LSM trees merging method and system of dynamic adaptable
CN106446039B (en) * 2016-08-30 2020-07-21 北京航空航天大学 Aggregation type big data query method and device
CN107861959A (en) * 2016-09-22 2018-03-30 阿里巴巴集团控股有限公司 Data processing method, apparatus and system
CN108021562B (en) * 2016-10-31 2022-11-18 中兴通讯股份有限公司 Disk storage method and device applied to distributed file system and distributed file system
CN106776811A (en) * 2016-11-23 2017-05-31 李天� data index method and device
CN106708968B (en) * 2016-12-01 2019-11-26 成都华为技术有限公司 Data processing method in distributed data base system and distributed data base system
CN106599247B (en) * 2016-12-19 2020-04-17 北京奇虎科技有限公司 Method and device for merging data files in LSM-tree structure
CN106777230B (en) * 2016-12-26 2020-01-07 东软集团股份有限公司 Partition system, partition method and device
CN108628542B (en) * 2017-03-22 2021-08-03 华为技术有限公司 File merging method and controller
CN107391541B (en) * 2017-05-16 2020-10-20 创新先进技术有限公司 Real-time data merging method and device
CN107357921A (en) * 2017-07-21 2017-11-17 北京奇艺世纪科技有限公司 A kind of small documents storage localization method and system
CN110019092B (en) * 2017-12-27 2021-07-09 华为技术有限公司 Data storage method, controller and system
CN108563698B (en) 2018-03-22 2021-11-23 中国银联股份有限公司 Region merging method and device for HBase table
CN110399545B (en) * 2018-04-20 2023-06-02 伊姆西Ip控股有限责任公司 Method and apparatus for managing document index
CN110825794B (en) * 2018-08-14 2022-03-29 华为云计算技术有限公司 Partition merging method and database server
WO2020034818A1 (en) * 2018-08-14 2020-02-20 华为技术有限公司 Partition merging method and database server
CN110321349B (en) * 2019-06-13 2021-11-12 暨南大学 Self-adaptive data merging and storing method for data origin system
CN110888837B (en) * 2019-11-15 2021-01-22 星辰天合(北京)数据科技有限公司 Object storage small file merging method and device
CN113342813B (en) * 2021-06-09 2024-01-26 南京冰鉴信息科技有限公司 Key value data processing method, device, computer equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605028A (en) * 2009-02-17 2009-12-16 北京安天电子设备有限公司 A kind of combining log records method and system
CN102905311A (en) * 2012-09-29 2013-01-30 北京傲天动联技术有限公司 Data-message aggregating device and method
CN102968503A (en) * 2012-12-10 2013-03-13 曙光信息产业(北京)有限公司 Data processing method for database system, and database system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605028A (en) * 2009-02-17 2009-12-16 北京安天电子设备有限公司 A kind of combining log records method and system
CN102905311A (en) * 2012-09-29 2013-01-30 北京傲天动联技术有限公司 Data-message aggregating device and method
CN102968503A (en) * 2012-12-10 2013-03-13 曙光信息产业(北京)有限公司 Data processing method for database system, and database system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《云计算实战》;张德丰;《清华大学出版社》;20120731;全文 *
《分布式存储系统中一致性哈希算法的研究》;杨彧剑等;《Computer Knowledge and Technology 电脑知识与技术》;20110831;第7卷(第22期);全文 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046638A (en) * 2018-12-29 2019-07-23 阿里巴巴集团控股有限公司 Fusion method, device and the equipment of multi-platform data

Also Published As

Publication number Publication date
CN103593436A (en) 2014-02-19

Similar Documents

Publication Publication Date Title
CN103593436B (en) file merging method and device
CN108600321A (en) A kind of diagram data storage method and system based on distributed memory cloud
CN106777351B (en) Computing system and its method are stored based on ART tree distributed system figure
CN103488709B (en) A kind of index establishing method and system, search method and system
CN110268394A (en) KVS tree
CN102024045B (en) Information classification processing method, device and terminal
CN110291518A (en) Merge tree garbage index
US20090094416A1 (en) System and method for caching posting lists
CN110268399A (en) Merging tree for attended operation is modified
CN110162528A (en) Magnanimity big data search method and system
CN104850572A (en) HBase non-primary key index building and inquiring method and system
CN101996250A (en) Hadoop-based mass stream data storage and query method and system
CN106155934B (en) Caching method based on repeated data under a kind of cloud environment
CN109815234A (en) A kind of multiple cuckoo filter under streaming computing model
CN101226542B (en) Method for caching report
CN103176754A (en) Reading and storing method for massive amounts of small files
CN102857560A (en) Multi-service application orientated cloud storage data distribution method
CN106407224A (en) Method and device for file compaction in KV (Key-Value)-Store system
CN109767274B (en) Method and system for carrying out associated storage on massive invoice data
Jaiyeoba et al. Graphtinker: A high performance data structure for dynamic graph processing
CN103942301B (en) Distributed file system oriented to access and application of multiple data types
CN108021333A (en) The system of random read-write data, device and method
CN106326012A (en) Web application cluster buffer utilization method and system
CN110245129A (en) Distributed global data deduplication method and device
CN105279166B (en) File management method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant