CN103593436B - file merging method and device - Google Patents
file merging method and device Download PDFInfo
- Publication number
- CN103593436B CN103593436B CN201310561317.8A CN201310561317A CN103593436B CN 103593436 B CN103593436 B CN 103593436B CN 201310561317 A CN201310561317 A CN 201310561317A CN 103593436 B CN103593436 B CN 103593436B
- Authority
- CN
- China
- Prior art keywords
- file
- data
- key assignments
- node
- subregion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
Abstract
The embodiment of the invention provides a file merging method and device. The file merging method comprises the steps of carrying out partitioning on sections in the charge of nodes in a cluster system according to key value information of user data, determining that each node meets a first triggering condition, reading at least two first files from disks of the nodes to caches corresponding to the nodes, respectively determining the target partition which each key value belongs to according to the key values of the data, corresponding to users, stored in the first files, merging the data with the same key value, and storing the merged data corresponding to the key values into the corresponding target partitions. Partitioning is carried out on the sections in the nodes, the data are merged once, and the data with the same key value are stored in the same partition. In the inquiry process, the partitions where the data are located are determined according to the key values, and then the files in the partitions are scanned. As the number of the files in each partition is small, data screening only needs to be carried out on fewer files, and therefore the read performance is improved.
Description
Technical field
The present embodiments relate to data communication technology, more particularly, to a kind of Piece file mergence method and apparatus.
Background technology
Constantly develop with the Internet, the scale of internet, applications goes from strength to strength, the data stock that these applications rely on
Storage faces increasing challenge.Traditional relational data has been difficult to meet the storage demand of mass data, non-relational
Data base NoSql applies and gives birth to, for example:The BigTable of Google exploitation, the Cassandra of Facebook is non-relational
Data base.Generally non-relational database is a distributed system, the data distribution that it is stored on each node, at present
Non-relational database is all by concordance Hash mostly(hash)To realize, so-called consistent hashing, to be by hash function
All of hash value forms the ring connecing that joins end to end(Maximum is connected with minima), and non-relational database cluster
Each of node be responsible for the part of this ring, likewise, Hash is also carried out to the data needing storage, by cryptographic Hash
The node of responsible data storage just can be found, thus reached data storage and the correspondence of node.
For the data storage physical storage structure of each node, traditional relevant database has fixing block,
The read-write of data can be repeated, and non-relational database, in order to ensure concurrent write performance, employs disk random write mould
Formula, with a logical data file on disk for minimum data storage cell, does not delete legacy data, but passes through timestamp
New and old determine up-to-date data, this data persistence mode different from relevant database is current many non-relational numbers
Adopt according to place.Fig. 1 is that the data of non-relational database deposits schematic diagram, as shown in figure 1, when writing data, first will
Data writes memory table(memory table)In, when the data in memory table is full, the data to be written in memory table is passed through
Flush mode writes disk becomes a file group, and the data output format of this document group can be ordered into string table
(SSTable, Sorted String Table)'s.Each file group includes one group of file, and each file is respectively used to storage and uses
User data, the index information of file, the hash algorithm of key assignments, static statistics file.As shown in figure 1, by memory table(To be written)In
Data be currently written in file group n.The data of same subscriber may be dispersed in multiple different files, with file
It is on the increase, data each time is read to be required for carrying out data screening from multiple files, by contrasting the time of identical recordings
Stamp is new and old, could determine to need to return which bar record to client.Under such a scenario, the time is more long, and data file is more,
The reading performance of that whole data base has and exponentially declines.Therefore have been proposed in many non-relational database products
And the method achieving data merging, identical using key assignments, data is compared by merging by being distributed in scattered data file
Relatively(MAP-REDUCE)It is incorporated in a big data file, compared by multiple merging, quantity of documents is reduced, come with this
The reading performance of lifting data base.In prior art, even if by the method for Piece file mergence, the number of the file that node stores is still
So larger, data each time is read to be required for carrying out data screening from multiple files, and the reading performance of whole data base is still not
High.
Content of the invention
The embodiment of the present invention provides a kind of Piece file mergence method and apparatus, can lift the reading performance of file.
First aspect present invention provides a kind of Piece file mergence method, including:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively, described
Each subregion of each node is corresponded with the key assignments of user data;
For each node, determine that described node meets the first trigger condition, read from the disk of described node to
To the corresponding caching of described node, described each first file did not carried out merging to few two the first files, described each first literary composition
Be stored with part the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Mesh belonging to described each key assignments is determined respectively according to the key assignments of the corresponding data of user each in described each first file
Mark subregion;
The data in each first file according to described each key-value pair with identical key assignments merges, and after merging
The corresponding data storage of described each key assignments is in corresponding target partition.
In the first possible implementation of first aspect present invention, described respectively use according in described each first file
The key assignments of the corresponding data in family determines the target partition belonging to described each key assignments respectively, including:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
In the possible implementation of the second of first aspect present invention, each target literary composition according to described each key-value pair
The data in part with identical key assignments merges, and after merging the corresponding data storage of described each key assignments in corresponding target
After in subregion, also include:
Determine that described target partition meets the second trigger condition, read from the disk of the node that described target partition is located
To in cache, be stored with described each second file at least two second files in described target partition at least one user couple
The data answered, the key assignments of the corresponding data of each user is different;
The data in each second file described in key-value pair according to the corresponding data of described each user with identical key assignments is entered
Row merges, and after merging the corresponding data storage of described each key assignments in the 3rd file of described target partition.
The first in conjunction with first aspect present invention and first aspect and the possible implementation of second, in the present invention
In the third possible implementation of first aspect, in each file destination according to described each key-value pair, there is identical key assignments
Data merge, and after the corresponding data storage of described each key assignments is in corresponding target partition after merging, also wrap
Include:
The key assignments of data to be checked when receiving data inquiry request, is obtained according to described data inquiry request;
The subregion to be checked that described data to be checked is located is determined according to the key assignments of described data to be checked, is treated according to described
The key assignments of inquiry data scans the All Files in described subregion to be checked, and the key assignments of the described data to be checked of acquisition is corresponding to be treated
Inquiry data.
In the 4th kind of possible implementation of first aspect present invention, the key assignments according to described data to be checked determines
The subregion to be checked that described data to be checked is located, including:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
Determine to be checked point of described data place to be checked according to the corresponding cryptographic Hash of key assignments of described data to be checked
Area.
In the 5th kind of possible implementation of first aspect present invention, described determination described node satisfaction first triggering
Condition, including:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its
In, the value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition.
In the 6th kind of possible implementation of first aspect present invention, described determination described node satisfaction first triggering
Condition, including:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its
In, the value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition.
In the 7th kind of possible implementation of first aspect present invention, described determine that described target partition meets second
Trigger condition, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number
M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition.
In the 8th kind of possible implementation of first aspect present invention, described determine that described target partition meets second
Trigger condition, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number
M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition.
In the 9th kind of possible implementation of first aspect present invention, the described key value information pair according to user data
The interval that in group system, each node is responsible for carries out subregion respectively, including:
If the interval that described group system is responsible for is(Min, min+2^127], described group system is responsible for
Interval is split according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents that described group system is responsible for
Interval minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and min takes
Value is the positive integer more than or equal to 0, and N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for is(R1, r2], then according to described subregion step-length S to institute
State node and carry out subregion, obtain/S interval of | r1 r2 |, interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn,
rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval described represents institute for a subregion of described node, r1
State the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2
Value be all positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
Second aspect present invention provides a kind of Piece file mergence device, including:
Division module, interval difference node each in group system being responsible for for the key value information according to user data
Carry out subregion, each subregion of described each node is corresponded with the key assignments of user data;
File read module, for for each node, determining described node satisfaction the first trigger condition, from described section
At least two first files are read to the corresponding caching of described node, described each first file did not carried out conjunction in the disk of point
And, be stored with described each first file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Subregion determining module, for determining institute respectively according to the key assignments of the corresponding data of user each in described each first file
State the target partition belonging to each key assignments;
File combination module, the data for having identical key assignments in the first file each according to described each key-value pair is entered
Row merges, and after merging the corresponding data storage of described each key assignments in corresponding target partition.
In the first possible implementation of second aspect present invention, described subregion determining module specifically for:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
In the possible implementation of the second of second aspect present invention, described read module is additionally operable to:
Determine that described target partition meets the second trigger condition, read from the disk of the node that described target partition is located
To in cache, be stored with described each second file at least two second files in described target partition at least one user couple
The data answered, the key assignments of the corresponding data of each user is different;
Described file combination module, is additionally operable to each second file described in key-value pair according to the corresponding data of described each user
In there is the data of identical key assignments merge, and after merging the corresponding data storage of described each key assignments in described target partition
The 3rd file in.
The first in conjunction with second aspect present invention and second aspect and the possible implementation of second, in the present invention
In the third possible implementation of second aspect, also include:Receiver module, key assignments acquisition module and enquiry module;
Described receiver module, for receiving data inquiry request;
Described key assignments acquisition module, for when described receiver module receives data inquiry request, according to described data
Inquiry request obtains the key assignments of data to be checked;
Described subregion determining module is additionally operable to:Determine that described data to be checked is located according to the key assignments of described data to be checked
Subregion to be checked;
Described enquiry module, scans all literary compositions in described subregion to be checked for the key assignments according to described data to be checked
Part, obtains the corresponding data to be checked of key assignments of described data to be checked.
In the 4th kind of possible implementation of second aspect present invention, described subregion determining module specifically for:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
Determine to be checked point of described data place to be checked according to the corresponding cryptographic Hash of key assignments of described data to be checked
Area.
In the 5th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its
In, the value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition, read at least two the from the disk of described node
One file is to the corresponding caching of described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, its
In, the value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition, read at least two the from the disk of described node
One file is to the corresponding caching of described node.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number
M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition, the magnetic of the node being located from described target partition
Read in disk at least two second files extremely caching in described target partition.
In the 6th kind of possible implementation of second aspect present invention, described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number
M2, wherein, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition, the node being located from described target partition
Read in disk at least two second files extremely caching in described target partition.
In the 9th kind of possible implementation of second aspect present invention, described division module specifically for:
If the interval that described group system is responsible for is(Min, min+2^127], described group system is responsible for
Interval is split according to 2^N, obtains subregion step-length S, S=2^127/2^N, wherein, min represents that described group system is responsible for
Interval minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and min takes
Value is the positive integer more than or equal to 0, and N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for is(R1, r2], then according to described subregion step-length S to institute
State node and carry out subregion, obtain/S interval of | r1 r2 |, interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn,
rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval described represents institute for a subregion of described node, r1
State the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2
Value be all positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
The Piece file mergence method and apparatus of the embodiment of the present invention, carries out subregion by the interval that node is responsible for, when full
During foot the first trigger condition, once being merged without the file merging on triggering node, after a secondary data merges,
The user data with identical key assignments is stored in same subregion, makes the particle size reduction that data is deposited.In inquiry, root first
Determine the subregion that data is located according to key assignments, the data needing in each data file query of affiliated subarea-scanning, due to literary composition in subregion
Part number less it is only necessary to carry out data screening from less file, thus lifting reading performance.
Brief description
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description are these
Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, acceptable
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is that the data of non-relational database deposits schematic diagram;
Fig. 2 is aggregated pattern system diagram;
Fig. 3 is a kind of schematic diagram of Piece file mergence in prior art;
Fig. 4 is the flow chart of Piece file mergence embodiment of the method one of the present invention;
Fig. 5 is the flow chart of Piece file mergence embodiment of the method two of the present invention;
The partition of nodes schematic diagram that Fig. 6 is suitable for by the present embodiment;
Fig. 7 is the structural representation of Piece file mergence device embodiment one of the present invention;
Fig. 8 is the structural representation of Piece file mergence device embodiment two of the present invention;
Fig. 9 is the structural representation of Piece file mergence device embodiment three of the present invention.
Specific embodiment
Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is
The a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment being obtained under the premise of not making creative work, broadly falls into the scope of protection of the invention.
Before introducing various embodiments of the present invention, first simply introduce the field that various embodiments of the present invention are suitable for
Scape, various embodiments of the present invention are primarily adapted for use in non-relational database, non-relational database can by concordance Hash Lai
Realize, so-called consistent hashing, be that all of for hash function hash value is formed the Hash ring connecing that joins end to end, this Hash
First of ring is maximum, and last is minima, and that is, maximum is connected with minima, all Kazakhstan on this Hash ring
Uncommon value constitutes the interval of non-relational database group system, and each of cluster node is responsible for the part of this Hash ring
Interval, that is, on each node, the corresponding cryptographic Hash of data of storage must fall in the interval that this node is responsible for, likewise, to need
Data to be stored is also carried out Hash, just can find the node of responsible data storage by cryptographic Hash, has thus reached number
Correspondence according to storage and node.
Fig. 2 is aggregated pattern system diagram, as shown in Fig. 2 the aggregated pattern system in Fig. 2 has four nodes, in figure is breathed out
On uncommon ring, four great circles of distribution represent four nodes respectively, and this four great circles are the circle pointed by dotted arrow, four
Individual node is respectively node 1, node 2, node 3, node 4, and corresponding one of each node caches and disk, between each great circle
Small circle represent user data.The interval of the cryptographic Hash that group system is responsible for is(0,2^32], according to clockwise, save
Interval between point 4 and node 1 is the interval that node 1 is responsible for, and the interval between node 1 and node 2 is the area that node 2 is responsible for
Between, the interval between node 2 and node 3 is the interval that node 3 is responsible for, and the interval between node 3 and node 4 is negative by node 4
The interval of duty.The cryptographic Hash in the interval that each node is responsible for is different, when each node receives data, first obtains data institute right
The key value information of the user answering, the cryptographic Hash corresponding to calculation key, judge which node institute is the corresponding cryptographic Hash of key assignments fall in
Responsible is interval interior, and user data is stored on this node.Fig. 2 is that citing illustrates, and the node of group system may
Can be more.
Fig. 4 is the flow chart of Piece file mergence embodiment of the method one of the present invention, and the method that the present embodiment provides is by each node
Execute respectively, each node closes to the file on oneself node as follows according to the method that the present embodiment provides
And, as shown in figure 4, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 101, carried out respectively according to the interval that the key value information of user data is responsible for node each in group system
Subregion.
The interval that node is responsible for carries out subregion, refer to according to certain rule, the interval division that node is responsible for is
Less interval, each interval after division is a subregion, and this subregion is exactly this interval subinterval, at this
In each bright embodiment, subregion is exactly certain interval subinterval.The method that the present embodiment provides, by being born to each node
The interval of duty is further divided into less subregion, and each subregion is responsible for a less interval, each subregion and for data
Key assignments correspond.
Step 102, for each node, determine this node meet the first trigger condition, from the disk of this node read
Take at least two first files to the corresponding caching of this node, each first file did not carried out merging, deposited in each first file
Contain the corresponding data of at least one user, the key assignments of the corresponding data of each user is different.
Special module is had to be responsible for the merging of file on each node, this module passes through to determine that meeting first on this node touches
Clockwork spring part, triggers Piece file mergence task when this node meets the first trigger condition.Determine whether this node meets the first triggering
Condition, specifically, judging on this node whether the number of the first file of storage reaches default first Piece file mergence number M1,
The value of M1 is the positive integer more than or equal to 2, if it is determined that node meets the first trigger condition.Default first file closes
And number is, for example, 4, then when the first number of files on node reaches 4, just trigger the merging task of file, here the first literary composition
Part refers to the file not carrying out merging, and when this node receives user data, user data is stored the literary composition on node
In part, this node is also stored with some carried out merge file, if the last each file of node meet this first touch
Clockwork spring part, then trigger merging task.
Certainly, the first trigger condition can also include other default conditions, such as when the first file on this node
Number reaches default Piece file mergence number, determines whether each first file meets the according to the size of each first file further
One merging condition;If it is determined that node meets the first trigger condition.Here, determine each according to the size of each first file
Whether one file meets the first merging condition, specially judges that the difference in size of each first file meets default threshold value, if respectively
The difference in size of the first file meets default threshold value, just triggers Piece file mergence task, if each first merging file size is poor
Different very big, then do not carry out Piece file mergence, to illustrate especially by an example, if there are 4 the first files, respectively with 1,2,3,
4 expressions, file 1 size is 100M, and file 2 size is 200M, and file 3 size is 300M, and file 4 size is 50M, when reading
After file 1 and file 2, the meansigma methodss taking file 1 and file 2 size are 150M, are then multiplied by the maximum weighted factor with meansigma methodss
With the minimum weight factor it is generally the case that the maximum weighted factor is 1.5, the minimum weight factor is 0.5, in the present embodiment, averagely
Value is multiplied by the maximum weighted factor and the minimum weight factor respectively obtains 75M and 225M, if the size of file 3 falls【75M,
225M】In interval, then the size to fit of supporting paper 3 meets merging condition, can merge with file 1,2, the present embodiment
Middle file 3 size be 300M beyond【75M, 225M】Interval, so being unsatisfactory for merging condition, then can be with same method
Continue the size of comparison document 4, in the present embodiment, file 4 size is 50M although not falling within【75M, 225M】In interval, but
Being due to very little of file 4 itself, even if merging also taken how many resources, therefore, can be unsatisfactory for for small documents above-mentioned
Interval, can arrange a threshold value, if file is less than this threshold value, directly file is merged, for example, arrange threshold value
For 50M.Here, instruction to illustrate if it is determined that whether each file size difference meets default threshold value for an example, when
So can also be judged by additive method, will not enumerate here.
Step 103, the mesh belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file
Mark subregion.
After reading the first file to be combined, determined respectively respectively according to the key assignments of the corresponding data of user each in the first file
Target partition belonging to key assignments, that is, judge which subregion the key assignments of the corresponding data of each user falls in.Corresponded to according to each user
The key assignments of data determine target partition belonging to each key assignments respectively, specially:First, the key according to the corresponding data of each user
Value calculates the corresponding cryptographic Hash of each key assignments respectively;Then, the target belonging to each key assignments is determined according to the corresponding cryptographic Hash of each key assignments
Subregion, the corresponding cryptographic Hash of different key assignments may fall in different subregions.
Step 104, merged according to the data in each first file of each key-value pair with identical key assignments, and after merging
The corresponding data storage of each key assignments is in corresponding target partition.
In this step, the data in each first file with identical key assignments is merged, the data of such as party A-subscriber is respectively
It is stored in file 1, file 2 and file 3, then reads the data of party A-subscriber from these three files respectively, the data of party A-subscriber exists
Key assignments in three files is identical, then be stored in target partition after merging the data of party A-subscriber, and target partition is to be used according to A
The cryptographic Hash of the corresponding key assignments of user data determines.After the completion of Piece file mergence, the data storage after merging can be divided in target
In second file in area, for example, it is stored in second file of target partition A, each file can correspond to a static statistics
File, static statistics file is used for the related information of this document that is stored with, the such as time of data write, and size of file etc. is believed
Breath, in the present embodiment, by the corresponding relation of this second file and said target subregion is saved in static statistics file, after
Continue when node reboot, this second file directly can be loaded into by this target partition pair according to static statistics file
In the partitioned file list answered, in partitioned file list, record has the information of All Files in this target partition, such as this target
The storage address information that each file in subregion is located on disk respectively, when node reboot, according to static statistics file
All of file in target partition is all carried in the corresponding partitioned file list of this target partition, is carrying out data query
When, according to the corresponding relation of this second file in static statistics file and said target subregion, find this partitioned file row
Table, then reads data according to this partitioned file list from disk.
If it should be noted that the data of user A is only stored in file 1, not storing A in file 2 and file 3
The data of user, at this moment, actually during merging, for the data of user A, does not merge, simply by user A
Data according to the corresponding target partition of key assignments, by the second file on the data Cun Chudao target partition of party A-subscriber, only when
In two the first files be stored with key assignments identical data when, just data is merged.
The method that the present embodiment provides, carries out subregion by the interval that node is responsible for, when meeting the first trigger condition
When, merging without the file merging on triggering node, be stored with each file the corresponding data of at least one user,
The key assignments of the corresponding data of each user is different, calculates the corresponding Kazakhstan of each key assignments respectively according to the key assignments of the corresponding data of each user
Uncommon value, and determine the target partition belonging to the corresponding cryptographic Hash of each key assignments, then will there is identical key assignments in file to be combined
Data merge, and the target partition according to belonging to the corresponding cryptographic Hash of this key assignments, should by data Cun Chudao after merging
In target partition.By subregion is carried out to node inner region, it is then passed through a secondary data and merges, will there is the user of identical key assignments
Data storage, in same subregion, makes the particle size reduction that data is deposited.When queried, determine what data was located according to key assignments first
Subregion, the data needing in each data file query of affiliated subarea-scanning, due in subregion file number less it is only necessary to from relatively
Carry out data screening, thus reaching reading performance lifting in few file.
Fig. 5 is the flow chart of Piece file mergence embodiment of the method two of the present invention, and the method that the present embodiment provides is by each node
Execute respectively, each node closes to the file on oneself node as follows according to the method that the present embodiment provides
And, compare with embodiment one, the present embodiment on the basis of embodiment one, when meeting merging condition in subregion, to this subregion
Interior file merges further.As shown in figure 5, the Piece file mergence method that the present embodiment provides comprises the following steps:
Step 201, carried out respectively according to the interval that the key value information of user data is responsible for node each in group system
Subregion.
The interval that node is responsible for carries out subregion, refer to according to certain rule, the interval division that node is responsible for is
Less interval, each interval after division is a subregion.In the present embodiment, subregion can be carried out in the following manner:As
The interval that fruit group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is torn open according to 2^N
Point, obtain subregion step-length S, S=2^127/2^N, wherein, min represents the minimum hash in the interval that group system is responsible for, min
+ 2^127 represents the maximum cryptographic Hash in the interval that group system is responsible for, and the value of min is the positive integer more than or equal to 0, and N is to divide
Area's factor, the value of N is the positive integer more than or equal to 1, and district factor N can be configured according to actual needs.
For each node, if the interval that node is responsible for is(R1, r2], then according to subregion step-length S, node is carried out
Subregion, obtains, and | r1 r2 |/S is interval, and interval closed at the right is opened on sequentially adjacent one left side of two interval formation(Rn, rn+1], its
In, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents, for a subregion of node, r1, the interval that node is responsible for
Minimum hash, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, and the value of r1 and r2 more than or equal to 0 is just all
Integer, and r2 is more than r1, | r1 r2 | represents the absolute value taking r1 and r2 difference.
Merely just list a kind of partition method, in subregion, the size of each subregion can be the same or different,
The present invention is not limited to this.
Hereinafter will be illustrated by a specific example, the partition of nodes schematic diagram that Fig. 6 is suitable for by the present embodiment, such as scheme
Shown in 6, upper in figure circle represents whole group system, has four nodes in group system, is node A, node B, section respectively
Point C, node D, the scope of the corresponding cryptographic Hash in interval that node A is responsible for is(70,10], wherein, 70 and 10 expression cryptographic Hash, section
The interval that point B is responsible for is(10,30], the interval that node C is responsible for is(30,50], the interval that node D is responsible for is(50,70].To save
Illustrate as a example point B, the interval being responsible for node B first is(10,30] it is divided into three subregions, respectively subregion
(slice)B1, subregion B2 and subregion B3, the interval of subregion B1 is(10,15], the interval of subregion B2 is(15,25], subregion B3
Interval is(25,30].5 the first files to be combined are had on node B, this 5 Piece file mergences is 3 the second files, merges
3 the second files afterwards are belonging respectively to different subregions, the second file after merging is respectively stored into corresponding target is divided
Area, as shown in Figure 6, the direction of arrow instruction represents the target partition belonging to each second file, below will be by specific
Example is illustrating it is assumed that there being 5 the first files to be combined to be respectively on node B:File 1, file 2, file 3, file 4 and literary composition
Part 5, wherein, the data of three users that file 1 is stored with, the corresponding key assignments of data of user A1 is a1, the data of user A2
Corresponding key assignments is a2, the corresponding key assignments of data of user A3 is a3;Be stored with file 2 data of user A1 and A2, file 3
On be stored with the data of user A1, the data of user A4 and A5 that be also stored with, the corresponding key assignments of data of user A4 is a4, user
The corresponding key assignments of data of A5 is a5, the data of user A3 and user A4 that file 4 is stored with, user A2 that file 5 is stored with,
The data of A3, A4, A5.Here with key assignments, the data as a1 illustrates as a example merging, and first, calculates according to hash algorithm
The corresponding cryptographic Hash of key assignments a1, it is then determined which subregion is the corresponding cryptographic Hash of key assignments a1 fall in it is assumed that key assignments a1 is corresponding
Cryptographic Hash falls in subregion B1, then subregion B1 is defined as the target partition belonging to key assignments a1, finally, by file 1, file 2 and
In file 3, the data for a1 for the key assignments merges, and by the data storage after merging in the file 6 on subregion B1.According to same
The method of sample, merges for a2, a3, a4 and a5 to key assignments respectively it is assumed that the target partition belonging to key assignments a2 and a5 is subregion
B2, the target partition belonging to key assignments a1 and a3 is subregion B1, and the target partition belonging to key assignments a4 is subregion B3, then by key assignments a1 and
In file 6, file 6 belongs to subregion B1 to data storage after a3 merging, and the data storage after key assignments a2 and a5 is merged is in literary composition
In part 7, file 7 belongs to B2, and in file 8, file 8 belongs to B3 to the data storage after key assignments a4 is merged, file 6, file 7,
File 8 is all the second file.By above-mentioned merging, the data in five the first files is merged in three the second files,
Each second file belongs to different sections, thus the data of user A1, A3 is stored on subregion B1 by merging, user
The data of A2, A5 is stored on subregion B2 by merging, and the data of user A4 is stored on subregion B3 after merging.
Step 202, for each node, determine this node meet the first trigger condition, from the disk of this node read
Take at least two first files to the corresponding caching of this node.
In this step, the first file refers to the file not carrying out merging, and be stored with each first file at least one use
The corresponding data in family, the key assignments of the corresponding data of each user is different.Specific implementation can refer to a kind of step 102 of embodiment
In description.
Step 203, the mesh belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file
Mark subregion.
The corresponding data of different user can be distinguished by key assignments, for each user, all of data of this user
Key assignments all identical, user data and key assignments correspond, and find the corresponding data of this user in inquiry according to key assignments.This reality
Apply in example, determine that the target partition belonging to each key assignments is specially respectively according to the key assignments of the corresponding data of each user:First, according to
The key assignments of the corresponding data of each user calculates the corresponding cryptographic Hash of each key assignments respectively, after obtaining the corresponding cryptographic Hash of key assignments, sentences
Which subregion is each cryptographic Hash of breaking be belonging respectively to, and each subregion is responsible for an interval, the subregion belonging to the corresponding cryptographic Hash of each key assignments
It is the target partition belonging to each key assignments.
Step 204, merged according to the data in each first file of each key-value pair with identical key assignments, and after merging
The corresponding data storage of each key assignments is in corresponding target partition.
In this step, the data in each first file with identical key assignments is merged, and the data storage after merging is existed
In second file of target partition.After the completion of merging, the relation of this second file and affiliated target partition is stored in this
In second file corresponding static statistics file, each file corresponds to a static statistics file, subsequently works as node again
When startup, this second file directly can be loaded into by the corresponding partitioned file of this target partition according to static statistics file
In list, when carrying out data query, according to the pass of this second file in static statistics file and affiliated target partition
System, finds the partitioned file list of this target partition, is then read from disk according to the partitioned file list of this target partition
Data.
Step 205, determine target partition meet the second trigger condition, from target partition be located node disk read
At least two second files in target partition are in cache.
In above-mentioned steps 201-204, by carrying out subregion to node inner region, the granularity that data is deposited reduces again,
The file not carrying out merging once is merged, by once merging, data is distributed in different subregions.And this step
In rapid, when the file in certain subregion meets merging condition, the file in subregion is merged again, each subregion is all
Independent, when merging, only the file in this subregion is merged.Specifically, when target partition meets the second triggering bar
During part, read at least two second files to be combined in target partition from the disk of the node that target partition is located to slow
In depositing, be stored with each second file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different.
In the present embodiment, determine whether target partition meets the second trigger condition, specifically, judging storage on target partition
The number of the second file whether reach default second Piece file mergence number M2, the value of M2 is the positive integer more than or equal to 2,
Certainly, the second trigger condition can also include other default conditions, such as when the number of the second file on this target partition
Reach default second Piece file mergence number, determine whether each second file meets according to the size of each second file further
Two merging conditions, determine whether each second file meets the second merging condition and refer specifically to judge that the difference in size of each second file is
No meet default max-thresholds, just trigger Piece file mergence and appoint if each second file size difference meets default max-thresholds
Business, if each second file size is widely different, does not carry out Piece file mergence, in the present embodiment, judges each second file size
Whether difference meets default max-thresholds, can be default using judging whether each first file size difference meets in embodiment one
Max-thresholds method, repeat no more here, but specific parameter setting can be different.For example, if one second
File size is 500M, and another second file size is 30M, then do not merge.Default Piece file mergence number M2 can
Arrange according to the actual needs, if had high demands to query rate, can be smaller by the value setting of M2, increase merging time
Number, to reduce the number of file, improves search efficiency.
Step 206, entered according to the data in each second file of key-value pair of the corresponding data of each user with identical key assignments
Row merges, and after merging the corresponding data storage of each key assignments in the 3rd file of target partition.
In this step, the data in the second file to be combined each in target partition with identical key assignments is incorporated in same
So that data high polymeric in same subregion in individual 3rd file.Fig. 3 is a kind of schematic diagram of Piece file mergence in prior art,
As shown in figure 3, after file number is more than 4, merging task will be initiated, 4 Piece file mergences are become a new file,
If have new Generating Data File again, the close data file of size can be chosen and merges by the comparison of size.Tool
Body is achieved in that, all sstable under same table is grouped by size, sstable similar for size is classified as
One group, so it is formed for n group(n>=1), then obtain minimum group one task of formation of mean size from this n group
Carry out the operation of compaction.Every group of size must be(4,32], in the range of, the excessive group of quantity will be blocked, it
By way of map-reduce, the data having identical key assignments is merged together afterwards, thus forming new data file.And should
Multi-thread concurrent execution multiple tasks are supported, the data file that each task is chosen differs in method.But prior art
There is problems with, the trigger condition that data file merges is low, and 4 files can trigger, daily generation merges often, greatly
The Piece file mergence of amount can take the resources such as sizable internal memory, CPU, I/O, and after data file merges through excessive wheel, key
Value has reached at a relatively high degree of integration, the merging now carrying out again, is only by the resettlement of key assignments data, only a small amount of
Merging action, causes the waste of system resource.The present embodiment provide method by big file declustering is become little partitioned file,
During merging only when the file in subregion meets merging condition, just the file in subregion is merged, due to literary composition in subregion
Part number seldom merges number of times such that it is able to reduce with respect to the file number on a node in prior art, improves and merges
Efficiency, in the case that data is evenly distributed, on each subregion, the file size of storage is identical, the combined efficiency of same quantity of data
More than 30% can be lifted.
Step 207, when receiving data inquiry request, according to data inquiry request obtain data to be checked key assignments,
The subregion to be checked that data to be checked is located is determined according to the key assignments of data to be checked, is treated according to the key assignments scanning of data to be checked
All Files in inquiry subregion, obtains the corresponding data to be checked of key assignments of data to be checked.
During data query, first the subregion that data to be checked is located is determined according to the key assignments of data to be checked, then to institute
The data file belonging in subregion carries out filter scan, and then will obtain qualified result and return.Specifically, when receiving
During data inquiry request, parse the key value information of data to be checked, then, calculate the corresponding Hash of key assignments of data to be checked
Value, determines, according to the corresponding cryptographic Hash of key assignments of data to be checked, the subregion to be checked that data to be checked is located.According to be checked
The key assignments of data scans the All Files in subregion to be checked, obtains the corresponding data to be checked of key assignments and returns.
The method that this enforcement provides, additionally it is possible to lifting search efficiency, in query script, contrasts former inquiry scan side
Formula, needs All Files on disk is once judged, determine data to be inquired about whether in this data file so that looking into
Ask efficiency low, and need data is frequently read from disk, increased the I/O expense of system, also can take excessive simultaneously
CPU and internal memory, cause the waste of resource.Assume that data is evenly distributed, data volume is 100G, query depth is 1000, if
Node is divided into four subregions, subregion search efficiency can lift 25%.
The method that the present embodiment provides, by carrying out subregion to node inner region, the granularity that data is deposited reduces again,
By once merging, data is distributed in different subregions, and when meeting merging condition in subregion, to the file in subregion
Merge, the data in subregion with identical key assignments is merged in a file, improves the degree of polymerization of file in subregion.
In query script, first the subregion that data to be inquired about is located is determined according to key assignments, the data file in affiliated subregion is entered
Row filter scan, and then qualified result will be obtained and return, because file number is few in subregion, and in each file
Data high polymeric, inquiry times can be reduced, improve search efficiency.
Fig. 7 is the structural representation of Piece file mergence device embodiment one of the present invention, the Piece file mergence dress that the present embodiment provides
Put naturally it is also possible to be independently arranged on each node can be integrated in group system, as shown in fig. 7, the literary composition that the present embodiment provides
Part merges device and includes:Division module 31, file read module 32, subregion determining module 33, file combination module 34.
Wherein, division module 31, are responsible for node each in group system for the key value information according to user data
Interval carries out subregion respectively, and each subregion of each node is corresponded with the key assignments of user data;
File read module 32, for for each node, determining node satisfaction the first trigger condition, the magnetic of from node
At least two first files are read to the corresponding caching of node, each first file did not carried out merging, each first file in disk
In be stored with the corresponding data of at least one user, the key assignments of the corresponding data of each user is different;
Subregion determining module 33, for determining each key respectively according to the key assignments of the corresponding data of user each in each first file
Target partition belonging to value;
File combination module 34, for being closed according to the data in each first file of each key-value pair with identical key assignments
And, and after merging the corresponding data storage of each key assignments in corresponding target partition.
In the present embodiment, subregion determining module 33 specifically for:Calculated respectively according to the key assignments of the corresponding data of each user
The corresponding cryptographic Hash of each key assignments;Determine the target partition belonging to each key assignments according to the corresponding cryptographic Hash of each key assignments.
The Piece file mergence device that the present embodiment provides, can be used for executing the technical scheme that embodiment of the method one provides, specifically
Implementation and technique effect type, repeat no more here.
Fig. 8 is the structural representation of Piece file mergence device embodiment two of the present invention, the Piece file mergence dress that the present embodiment provides
Put naturally it is also possible to be independently arranged on each node can be integrated in group system, as shown in figure 8, the literary composition that the present embodiment provides
Part merges device and includes:Division module 41, file read module 42, subregion determining module 43, file combination module 44.
Wherein, division module 41, are responsible for node each in group system for the key value information according to user data
Interval carries out subregion respectively, and each subregion of each node is corresponded with the key assignments of user data;
File read module 42, for for each node, determining node satisfaction the first trigger condition, the magnetic of from node
At least two first files are read to the corresponding caching of node, each first file did not carried out merging, each first file in disk
In be stored with the corresponding data of at least one user, the key assignments of the corresponding data of each user is different;
Subregion determining module 43, for determining each key respectively according to the key assignments of the corresponding data of user each in each first file
Target partition belonging to value;
File combination module 44, for being closed according to the data in each first file of each key-value pair with identical key assignments
And, and after merging the corresponding data storage of each key assignments in corresponding target partition.
In the present embodiment, division module 41 carries out subregion especially by the following manner to each node of group system:If
The interval that group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is split according to 2^N,
Obtain subregion step-length S, S=2^127/2^N, wherein, min represents the minimum hash in the interval that group system is responsible for, min+2^
The maximum cryptographic Hash in the interval that 127 expression group systems are responsible for, the value of min is the positive integer more than or equal to 0, and N is subregion
The factor, the value of N is the positive integer more than or equal to 1;For each node, if the interval that node is responsible for is(R1, r2], then root
According to subregion step-length S, subregion is carried out to node, obtains/S interval of | r1 r2 |, sequentially adjacent one left side of two interval formation is opened
Interval closed at the right(Rn, rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval is a subregion of node, r1
Represent the minimum hash in the interval that node is responsible for, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, r1's and r2
Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
File read module 42 specifically for:In decision node, whether the number of the first file of storage reaches default the
One Piece file mergence number M1, wherein, the value of M1 is the positive integer more than or equal to 2;If it is determined that node meets the first triggering
Condition, reads at least two first files to the corresponding caching of node in the disk of from node.Or, file read module 42
Specifically for:In decision node, whether the number of the first file of storage reaches default first Piece file mergence number M1, wherein,
The value of M1 is the positive integer more than or equal to 2;If so, determine whether each first file meets according to the size of each first file
One merging condition;If it is determined that node meets the first trigger condition, in the disk of from node, read at least two first files
To the corresponding caching of node.
Subregion determining module 43 specifically for:Each key assignments is calculated respectively according to the key assignments of the corresponding data of each user corresponding
Cryptographic Hash;Determine the target partition belonging to each key assignments according to the corresponding cryptographic Hash of each key assignments.
For each node, by after once merging, by each subregion of file distribution to this node, the present embodiment
In, when each distribution meets merging condition, further the file in subregion is merged, make the file in each subregion highly poly-
Close, therefore, in the present embodiment, file read module 42 is additionally operable to:Determine that target partition meets the second trigger condition, divide from target
Read in the disk of node that area is located at least two second files extremely caching in target partition, store in each second file
There is the corresponding data of at least one user, the key assignments of the corresponding data of each user is different.Correspondingly, file combination module 44,
It is additionally operable to the data according to having identical key assignments in each second file of key-value pair of the corresponding data of each user to merge, and will
After merging, the corresponding data storage of each key assignments is in the 3rd file of target partition.
In the present embodiment, especially by following two modes, file read module 42 determines whether target partition meets second
Trigger condition:The first, judge whether the number of the second file of storage on target partition reaches default second Piece file mergence
Number M2, wherein, the value of M2 is the positive integer more than or equal to 2;If it is determined that target partition meets the second trigger condition,
At least two second files in reading target partition from the disk of the node that target partition is located are to caching.Second,
Judge whether the number of the second file of storage on target partition reaches default second Piece file mergence number M2, wherein, M2's
Value is the positive integer more than or equal to 2;If so, determine whether each second file meets the second conjunction according to the size of each second file
And condition;If meeting it is determined that target partition meets the second trigger condition, read from the disk of the node that target partition is located
At least two second files in target partition are in cache.
Further, the Piece file mergence device of the present embodiment also includes:Receiver module 45, key assignments acquisition module 46 and inquiry
Module 47, receiver module 45, for receiving data inquiry request;Key assignments acquisition module 46, for receiving number when receiver module
The key assignments of data to be checked during according to inquiry request, is obtained according to data inquiry request;Subregion determining module 43 is additionally operable to:According to treating
The key assignments of inquiry data determines the subregion to be checked that data to be checked is located;Enquiry module 47, for according to data to be checked
Key assignments scans the All Files in subregion to be checked, obtains the corresponding data to be checked of key assignments of data to be checked.
In the present embodiment, subregion determining module 43 determines the subregion to be checked that data to be checked is located in the following manner:
First, calculate the corresponding cryptographic Hash of key assignments of data to be checked, then, true according to the corresponding cryptographic Hash of key assignments of data to be checked
The subregion to be checked that fixed data to be checked is located.
The Piece file mergence device that the present embodiment provides can be used for executing the technical scheme that embodiment of the method two provides, specifically real
Existing mode is similar with technique effect, repeats no more here.
Fig. 9 is the structural representation of Piece file mergence device embodiment three of the present invention, the Piece file mergence dress that the present embodiment provides
Put 500 to include:Processor 51, memorizer 52, receptor 53.Memorizer 52, receptor 53 are connected with processor 51 by bus.
Wherein, memorizer 52 storage execute instruction, when Piece file mergence device 500 runs, communicates between processor 51 and memorizer 52,
Processor 51 execution execute instruction makes Piece file mergence device 500 execute following operation:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively, each section
Each subregion of point is corresponded with the key assignments of user data;
For each node, determine that node meets the first trigger condition, in the disk of from node, read at least two the
To the corresponding caching of node, each first file did not carried out merging to one file, and be stored with each first file at least one use
The corresponding data in family, the key assignments of the corresponding data of each user is different;
Target partition belonging to each key assignments is determined respectively according to the key assignments of the corresponding data of user each in each first file;
Data according to having identical key assignments in each first file of each key-value pair merges, and each key assignments after merging
Corresponding data storage is in corresponding target partition.
According in each first file the key assignments of the corresponding data of each user determine target partition belonging to each key assignments respectively,
It is specially:The corresponding cryptographic Hash of each key assignments is calculated respectively according to the key assignments of the corresponding data of each user;Corresponding according to each key assignments
Cryptographic Hash determines the target partition belonging to each key assignments.
Processor 51 is additionally operable to:
Determine that target partition meets the second trigger condition, read target partition from the disk of the node that target partition is located
To in cache, be stored with each second file at least two second interior files the corresponding data of at least one user, each use
The key assignments of the corresponding data in family is different;
Data according to having identical key assignments in each second file of key-value pair of the corresponding data of each user merges, and
There is the data storage of identical key assignments in the 3rd file of target partition after merging.
Wherein, the subregion to be checked that data to be checked is located is determined according to the key assignments of data to be checked, including:Calculate to be checked
Ask the corresponding cryptographic Hash of key assignments of data;Determine what data to be checked was located according to the corresponding cryptographic Hash of key assignments of data to be checked
Subregion to be checked.
In the present embodiment, determine whether node meets the first trigger condition, including:First file of storage in decision node
Number whether reach default first Piece file mergence number M1, wherein, the value of M1 is the positive integer more than or equal to 2;If so,
Then determine that node meets the first trigger condition.
Or, in decision node, whether the number of the first file of storage reaches default first Piece file mergence number M1,
Wherein, the value of M1 is the positive integer more than or equal to 2;If so, whether each first file is determined according to the size of each first file
Meet the first merging condition;If it is determined that node meets the first trigger condition.
In the present embodiment, determine that target partition meets the second trigger condition, including:Judge second of storage on target partition
Whether the number of file reaches default second Piece file mergence number M2, and wherein, the value of M2 is the positive integer more than or equal to 2;
If it is determined that target partition meets the second trigger condition.Or, judge that the number of the second file of storage on target partition is
No reach default second Piece file mergence number M2, wherein, the value of M2 is the positive integer more than or equal to 2;If so, according to each
The size of two files determines whether each second file meets the second merging condition;If meeting it is determined that target partition meets second
Trigger condition.
In the present embodiment, the interval that the key value information according to user data is responsible for node each in group system is entered respectively
Row subregion, including:
If the interval that group system is responsible for is(Min, min+2^127], the interval that group system is responsible for is according to 2
^N is split, and obtains subregion step-length S, S=2^127/2^N, wherein, min represents the minimum Kazakhstan in the interval that group system is responsible for
Uncommon value, min+2^127 represents the maximum cryptographic Hash in the interval that group system is responsible for, and the value of min is just whole more than or equal to 0
Number, N is district factor, and the value of N is the positive integer more than or equal to 1;
For each node, if the interval that node is responsible for is(R1, r2], then node is carried out point according to subregion step-length S
Area, obtains, and | r1 r2 |/S is interval, and interval closed at the right is opened on sequentially adjacent one left side of two interval formation(rn, rn+1], wherein, 0<
n<| r1 r2 |/S 1, n are positive integer, and each interval represents the minimum in the interval that node is responsible for for a subregion of node, r1
Cryptographic Hash, r2 represents the maximum cryptographic Hash in the interval that node is responsible for, and the value of r1 and r2 is all the positive integer more than or equal to 0,
And r2 is more than r1, | r1 r2 | represents the absolute value taking r1 and r2 difference.
In the present embodiment, receptor 53 is used for receiving data inquiry request, and processor 51 is additionally operable to look into when receiving data
The key assignments of data to be checked when asking request, is obtained according to data inquiry request;Determined to be checked according to the key assignments of data to be checked
The subregion to be checked that data is located, the key assignments according to data to be checked scans the All Files in subregion to be checked, obtains to be checked
Ask the corresponding data to be checked of key assignments of data.
The Piece file mergence device that the present embodiment provides, can be used for executing the side shown in embodiment of the method one and embodiment two
Method, specific implementation is similar with technique effect, repeats no more here.
One of ordinary skill in the art will appreciate that:The all or part of step realizing above-mentioned each method embodiment can be led to
Cross the related hardware of programmed instruction to complete.Aforesaid program can be stored in a computer read/write memory medium.This journey
Sequence upon execution, executes the step including above-mentioned each method embodiment;And aforesaid storage medium includes:ROM, RAM, magnetic disc or
Person's CD etc. is various can be with the medium of store program codes.
Finally it should be noted that:Various embodiments above only in order to technical scheme to be described, is not intended to limit;To the greatest extent
Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that:Its according to
So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered
Row equivalent;And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology
The scope of scheme.
Claims (20)
1. a kind of Piece file mergence method is it is characterised in that include:
The interval that key value information according to user data is responsible for node each in group system carries out subregion respectively;
For each node, determine that described node meets the first trigger condition, read at least two from the disk of described node
To the corresponding caching of described node, described each first file did not carried out merging to individual first file, in described each first file
Be stored with the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Determine that the target belonging to described each key assignments is divided respectively according to the key assignments of the corresponding data of user each in described each first file
Area;
The data in each first file according to described each key-value pair with identical key assignments merges, and described after merging
The corresponding data storage of each key assignments is in corresponding target partition;
Wherein, the key assignments of the corresponding data of same user is identical, and the key assignments of the corresponding data of different user is different.
2. method according to claim 1 is it is characterised in that described corresponding according to each user in described each first file
The key assignments of data determines the target partition belonging to described each key assignments respectively, including:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
3. method according to claim 1 is it is characterised in that have in each first file according to described each key-value pair
The data of identical key assignments merges, and after merging the corresponding data storage of described each key assignments in corresponding target partition it
Afterwards, also include:
Determine that described target partition meets the second trigger condition, read described from the disk of the node that described target partition is located
, in cache, at least one user that is stored with described each second file is corresponding at least two second files in target partition
Data, the key assignments of the corresponding data of each user is different;
The data in each second file described in key-value pair according to the corresponding data of described each user with identical key assignments is closed
And, and after merging the corresponding data storage of described each key assignments in the 3rd file of described target partition.
4. the method according to any one of claim 1-3 it is characterised in that according to described each key-value pair each first literary composition
The data in part with identical key assignments merges, and after merging the corresponding data storage of each key assignments in corresponding target partition
After interior, also include:
The key assignments of data to be checked when receiving data inquiry request, is obtained according to described data inquiry request;
The subregion to be checked that described data to be checked is located is determined according to the key assignments of described data to be checked, according to described to be checked
The key assignments of data scans the All Files in described subregion to be checked, and the key assignments obtaining described data to be checked is corresponding to be checked
Data.
5. method according to claim 4 it is characterised in that determine described to be checked according to the key assignments of described data to be checked
Ask the subregion to be checked that data is located, including:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
The subregion to be checked that described data to be checked is located is determined according to the corresponding cryptographic Hash of key assignments of described data to be checked.
6. method according to claim 1 is it is characterised in that the described node of described determination meets the first trigger condition, bag
Include:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein,
The value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition.
7. method according to claim 1 is it is characterised in that the described node of described determination meets the first trigger condition, bag
Include:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein,
The value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition.
8. method according to claim 3 is it is characterised in that the described target partition of described determination meets the second triggering bar
Part, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its
In, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition.
9. method according to claim 3 is it is characterised in that the described target partition of described determination meets the second triggering bar
Part, including:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its
In, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition.
10. method according to claim 1 is it is characterised in that the described key value information according to user data is to cluster system
The interval that in system, each node is responsible for carries out subregion respectively, including:
According to the key value information of described user data, calculate the cryptographic Hash corresponding to key value information of described user data;
The cryptographic Hash corresponding to key value information according to described user data, the interval that node each in group system is responsible for is divided
Do not carry out subregion;
The described interval that node each in group system is responsible for carries out subregion respectively, including:
If the interval that described group system is responsible for be (min, min+2^127], the interval that described group system is responsible for
Split according to 2^N, obtained subregion step-length S, S=2^127/2^N, wherein, min represents the area that described group system is responsible for
Between minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and the value of min is
Positive integer more than or equal to 0, N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for be (r1, r2], then according to described subregion step-length S to described section
Point carries out subregion, obtains/S interval of | r1 r2 |, and interval closed at the right (r is opened on sequentially adjacent one left side of two interval formationn,
rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents described section for a subregion of described node, r1
The minimum hash in the interval that point is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 takes
Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
A kind of 11. Piece file mergence devices are it is characterised in that include:
Division module, interval node each in group system being responsible for for the key value information according to user data is carried out respectively
Subregion;
File read module, for for each node, determining described node satisfaction the first trigger condition, from described node
At least two first files are read to the corresponding caching of described node, described each first file did not carried out merging in disk,
Be stored with described each first file the corresponding data of at least one user, and the key assignments of the corresponding data of each user is different;
Subregion determining module, described each for being determined respectively according to the key assignments of the corresponding data of user each in described each first file
Target partition belonging to key assignments;
File combination module, the data for having identical key assignments in the first file each according to described each key-value pair is closed
And, and will merge after described each key assignments corresponding data storage in corresponding target partition;
Wherein, the key assignments of the corresponding data of same user is identical, and the key assignments of the corresponding data of different user is different.
12. devices according to claim 11 it is characterised in that described subregion determining module specifically for:
The corresponding cryptographic Hash of described each key assignments is calculated respectively according to the key assignments of the corresponding data of described each user;
Target partition belonging to described each key assignments is determined according to the corresponding cryptographic Hash of described each key assignments.
13. devices according to claim 11 are it is characterised in that described file read module is additionally operable to:
Determine that described target partition meets the second trigger condition, read described from the disk of the node that described target partition is located
, in cache, at least one user that is stored with described each second file is corresponding at least two second files in target partition
Data, the key assignments of the corresponding data of each user is different;
Described file combination module, is additionally operable to have in each second file described in key-value pair according to the corresponding data of described each user
The data having identical key assignments merges, and after merging the corresponding data storage of described each key assignments the of described target partition
In three files.
14. devices according to any one of claim 11-13 are it is characterised in that also include:Receiver module, key assignments obtain
Module and enquiry module;
Described receiver module, for receiving data inquiry request;
Described key assignments acquisition module, for when described receiver module receives data inquiry request, according to described data query
The key assignments of acquisition request data to be checked;
Described subregion determining module is additionally operable to:Waiting of described data place to be checked is determined according to the key assignments of described data to be checked
Inquiry subregion;
Described enquiry module, scans the All Files in described subregion to be checked for the key assignments according to described data to be checked,
Obtain the corresponding data to be checked of key assignments of described data to be checked.
15. devices according to claim 14 it is characterised in that described subregion determining module specifically for:
Calculate the corresponding cryptographic Hash of key assignments of described data to be checked;
The subregion to be checked that described data to be checked is located is determined according to the corresponding cryptographic Hash of key assignments of described data to be checked.
16. devices according to claim 11 it is characterised in that described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein,
The value of M1 is the positive integer more than or equal to 2;
If it is determined that described node meets the first trigger condition, read at least two first literary compositions from the disk of described node
Part is to the corresponding caching of described node.
17. devices according to claim 11 it is characterised in that described file read module specifically for:
Judge whether the number of the first file of storage on described node reaches default first Piece file mergence number M1, wherein,
The value of M1 is the positive integer more than or equal to 2;
If so, determine whether described each first file meets the first merging condition according to the size of described each first file;
If it is determined that described node meets the first trigger condition, read at least two first literary compositions from the disk of described node
Part is to the corresponding caching of described node.
18. devices according to claim 13 it is characterised in that described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its
In, the value of M2 is the positive integer more than or equal to 2;
If it is determined that described target partition meets the second trigger condition, from the disk of the node that described target partition is located
Read at least two second files extremely caching in described target partition.
19. devices according to claim 13 it is characterised in that described file read module specifically for:
Judge whether the number of the second file of storage on described target partition reaches default second Piece file mergence number M2, its
In, the value of M2 is the positive integer more than or equal to 2;
If so, determine whether described each second file meets the second merging condition according to the size of described each second file;
If meeting it is determined that described target partition meets the second trigger condition, the disk of the node being located from described target partition
Middle at least two second files reading in described target partition are in cache.
20. devices according to claim 11 it is characterised in that described division module specifically for:
According to the key value information of described user data, calculate the cryptographic Hash corresponding to key value information of described user data;
The cryptographic Hash corresponding to key value information according to described user data, the interval that node each in group system is responsible for is divided
Do not carry out subregion;
If the interval that described group system is responsible for be (min, min+2^127], the interval that described group system is responsible for
Split according to 2^N, obtained subregion step-length S, S=2^127/2^N, wherein, min represents the area that described group system is responsible for
Between minimum hash, min+2^127 represents the maximum cryptographic Hash in the interval that described group system is responsible for, and the value of min is
Positive integer more than or equal to 0, N is district factor, and the value of N is the positive integer more than or equal to 1;
For described each node, if the interval that node is responsible for be (r1, r2], then according to described subregion step-length S to described section
Point carries out subregion, obtains/S interval of | r1 r2 |, and interval closed at the right (r is opened on sequentially adjacent one left side of two interval formationn,
rn+1], wherein, 0<n<| r1 r2 |/S 1, n are positive integer, and each interval represents described section for a subregion of described node, r1
The minimum hash in the interval that point is responsible for, r2 represents the maximum cryptographic Hash in the interval that described node is responsible for, r1 and r2 takes
Value is all the positive integer more than or equal to 0, and r2 is more than r1, and | r1 r2 | represents the absolute value taking r1 and r2 difference.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310561317.8A CN103593436B (en) | 2013-11-12 | 2013-11-12 | file merging method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310561317.8A CN103593436B (en) | 2013-11-12 | 2013-11-12 | file merging method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103593436A CN103593436A (en) | 2014-02-19 |
CN103593436B true CN103593436B (en) | 2017-02-08 |
Family
ID=50083577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310561317.8A Active CN103593436B (en) | 2013-11-12 | 2013-11-12 | file merging method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103593436B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046638A (en) * | 2018-12-29 | 2019-07-23 | 阿里巴巴集团控股有限公司 | Fusion method, device and the equipment of multi-platform data |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104360824B (en) * | 2014-11-10 | 2017-12-12 | 北京奇虎科技有限公司 | The method and apparatus that a kind of data merge |
CN106326309B (en) * | 2015-07-03 | 2020-02-21 | 阿里巴巴集团控股有限公司 | Data query method and device |
CN105159915B (en) * | 2015-07-16 | 2018-07-10 | 中国科学院计算技术研究所 | The LSM trees merging method and system of dynamic adaptable |
CN106446039B (en) * | 2016-08-30 | 2020-07-21 | 北京航空航天大学 | Aggregation type big data query method and device |
CN107861959A (en) * | 2016-09-22 | 2018-03-30 | 阿里巴巴集团控股有限公司 | Data processing method, apparatus and system |
CN108021562B (en) * | 2016-10-31 | 2022-11-18 | 中兴通讯股份有限公司 | Disk storage method and device applied to distributed file system and distributed file system |
CN106776811A (en) * | 2016-11-23 | 2017-05-31 | 李天� | data index method and device |
CN106708968B (en) * | 2016-12-01 | 2019-11-26 | 成都华为技术有限公司 | Data processing method in distributed data base system and distributed data base system |
CN106599247B (en) * | 2016-12-19 | 2020-04-17 | 北京奇虎科技有限公司 | Method and device for merging data files in LSM-tree structure |
CN106777230B (en) * | 2016-12-26 | 2020-01-07 | 东软集团股份有限公司 | Partition system, partition method and device |
CN108628542B (en) * | 2017-03-22 | 2021-08-03 | 华为技术有限公司 | File merging method and controller |
CN107391541B (en) * | 2017-05-16 | 2020-10-20 | 创新先进技术有限公司 | Real-time data merging method and device |
CN107357921A (en) * | 2017-07-21 | 2017-11-17 | 北京奇艺世纪科技有限公司 | A kind of small documents storage localization method and system |
CN110019092B (en) * | 2017-12-27 | 2021-07-09 | 华为技术有限公司 | Data storage method, controller and system |
CN108563698B (en) | 2018-03-22 | 2021-11-23 | 中国银联股份有限公司 | Region merging method and device for HBase table |
CN110399545B (en) * | 2018-04-20 | 2023-06-02 | 伊姆西Ip控股有限责任公司 | Method and apparatus for managing document index |
CN110825794B (en) * | 2018-08-14 | 2022-03-29 | 华为云计算技术有限公司 | Partition merging method and database server |
WO2020034818A1 (en) * | 2018-08-14 | 2020-02-20 | 华为技术有限公司 | Partition merging method and database server |
CN110321349B (en) * | 2019-06-13 | 2021-11-12 | 暨南大学 | Self-adaptive data merging and storing method for data origin system |
CN110888837B (en) * | 2019-11-15 | 2021-01-22 | 星辰天合(北京)数据科技有限公司 | Object storage small file merging method and device |
CN113342813B (en) * | 2021-06-09 | 2024-01-26 | 南京冰鉴信息科技有限公司 | Key value data processing method, device, computer equipment and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101605028A (en) * | 2009-02-17 | 2009-12-16 | 北京安天电子设备有限公司 | A kind of combining log records method and system |
CN102905311A (en) * | 2012-09-29 | 2013-01-30 | 北京傲天动联技术有限公司 | Data-message aggregating device and method |
CN102968503A (en) * | 2012-12-10 | 2013-03-13 | 曙光信息产业(北京)有限公司 | Data processing method for database system, and database system |
-
2013
- 2013-11-12 CN CN201310561317.8A patent/CN103593436B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101605028A (en) * | 2009-02-17 | 2009-12-16 | 北京安天电子设备有限公司 | A kind of combining log records method and system |
CN102905311A (en) * | 2012-09-29 | 2013-01-30 | 北京傲天动联技术有限公司 | Data-message aggregating device and method |
CN102968503A (en) * | 2012-12-10 | 2013-03-13 | 曙光信息产业(北京)有限公司 | Data processing method for database system, and database system |
Non-Patent Citations (2)
Title |
---|
《云计算实战》;张德丰;《清华大学出版社》;20120731;全文 * |
《分布式存储系统中一致性哈希算法的研究》;杨彧剑等;《Computer Knowledge and Technology 电脑知识与技术》;20110831;第7卷(第22期);全文 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046638A (en) * | 2018-12-29 | 2019-07-23 | 阿里巴巴集团控股有限公司 | Fusion method, device and the equipment of multi-platform data |
Also Published As
Publication number | Publication date |
---|---|
CN103593436A (en) | 2014-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103593436B (en) | file merging method and device | |
CN108600321A (en) | A kind of diagram data storage method and system based on distributed memory cloud | |
CN106777351B (en) | Computing system and its method are stored based on ART tree distributed system figure | |
CN103488709B (en) | A kind of index establishing method and system, search method and system | |
CN110268394A (en) | KVS tree | |
CN102024045B (en) | Information classification processing method, device and terminal | |
CN110291518A (en) | Merge tree garbage index | |
US20090094416A1 (en) | System and method for caching posting lists | |
CN110268399A (en) | Merging tree for attended operation is modified | |
CN110162528A (en) | Magnanimity big data search method and system | |
CN104850572A (en) | HBase non-primary key index building and inquiring method and system | |
CN101996250A (en) | Hadoop-based mass stream data storage and query method and system | |
CN106155934B (en) | Caching method based on repeated data under a kind of cloud environment | |
CN109815234A (en) | A kind of multiple cuckoo filter under streaming computing model | |
CN101226542B (en) | Method for caching report | |
CN103176754A (en) | Reading and storing method for massive amounts of small files | |
CN102857560A (en) | Multi-service application orientated cloud storage data distribution method | |
CN106407224A (en) | Method and device for file compaction in KV (Key-Value)-Store system | |
CN109767274B (en) | Method and system for carrying out associated storage on massive invoice data | |
Jaiyeoba et al. | Graphtinker: A high performance data structure for dynamic graph processing | |
CN103942301B (en) | Distributed file system oriented to access and application of multiple data types | |
CN108021333A (en) | The system of random read-write data, device and method | |
CN106326012A (en) | Web application cluster buffer utilization method and system | |
CN110245129A (en) | Distributed global data deduplication method and device | |
CN105279166B (en) | File management method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |