CN103778255A

CN103778255A - Distributed file system and data distribution method thereof

Info

Publication number: CN103778255A
Application number: CN201410064361.2A
Authority: CN
Inventors: 张勤
Original assignee: SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-02-25
Filing date: 2014-02-25
Publication date: 2014-05-07

Abstract

The invention discloses a distributed file system and a data distribution method thereof. The data distribution method includes the following steps of driving the distributed file system into a plurality of storage layers, wherein each storage layer comprises a plurality of storage devices; arranging constitution information and replica placement strategy information of the storage layers and the storage devices of the storage layers; distributing object replicas to the storage devices in the storage layers according to the constitution information and the replica placement strategy information. According to the distributed file system and the data distribution method thereof, the pseudo-random data distribution method is used, the object replicas can be effectively and robustly distributed in a structured and hierarchical storage cluster, data distribution can be optimized to fully utilize available resources, and data security can be guaranteed to the greatest possible under the condition of consistent failure. The distributed file system and the data distribution method thereof have two advantages that data are completely distributed, and data in any part of a large-scale system can independently calculate object storage positions; a small number of requested metadata are basically static, and only when the devices perform addition or deletion do the metadata change.

Description

A kind of distributed file system and data distribution method thereof

Technical field

The present invention relates to the data management of file system, in particular, a kind of distributed file system and data distribution method thereof.

Background technology

In the epoch of information-based develop rapidly, a large amount of expansions of data make distributed file system enter the gold period of development, from data sharing to internet, applications, have been applied to all trades and professions.For most of distributed file systems or cluster file system, parallel file system, conventionally by independent to metadata and data, controlling stream separates with data stream, thereby obtain higher system extension and I/O concurrency, as can be seen here the importance of metadata and necessity.

But along with the sharply expansion of data volume, what file system faced will be the data of PB (1,000,000 GB) level and up to ten million memory nodes, no matter be centralized Metadata Service file system, distributed Metadata Service file system or without Metadata Service file system, all will be faced with great challenge.In the field of large data storage, the distribution of data is the key of file system performance often.

Most of system is just write data to the equipment fully not using.The main problem of this mode is: data are seldom moved after writing.Even very outstanding location mode also can become unbalanced when extending in system, because new disk or be empty, or only comprise little data.No matter old or new disk all can be extremely busy, this depends on system load, but just can make full use of useful resources in little situation.A good solution is: data are distributed on the memory device that system can use randomly.Can obtain like this equiblibrium mass distribution on probability and new used equipment is as one man mixed.When adding new equipment, portion exists the random sample of data will move on newly-increased memory device to reach balance.This method has an extraordinary advantage, and in the situation that operating load is very large, the operating load on all devices is all similar, still can guarantee good performance.In addition, in large-scale storage system, single large file will be distributed on a lot of available equipment randomly, and high concurrency and aggregate bandwidth is externally provided.But simply the Distribution Strategy based on hash (Hash) can not adapt to the variation of memory device, cause mass data migration again, wash away.And the copy of existing stochastic distribution strategy by distributing each disk, like this can obliterated data in the time having multiple equipment to lose efficacy together to contact copying on other equipment.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of new distributed file system and data distribution method thereof.

Technical scheme of the present invention is as follows: a kind of data distribution method of distributed file system, and it comprises the following steps: described distributed file system is divided into some accumulation layers, and each accumulation layer comprises some memory devices; The composition information of each accumulation layer and memory device thereof is set, and Replica Placement Strategy information; According to described composition information and described Replica Placement Strategy information, distribute objects copy is to the each memory device in each accumulation layer.

Preferably, in described data distribution method, before distribute objects copy, also carry out following steps: select or revise at least one Replica Placement Strategy in described Replica Placement Strategy information.

Preferably, in described data distribution method, arrange in the composition information of each accumulation layer and memory device thereof, further comprising the steps of: the composition information that each accumulation layer and memory device thereof are set is the level cluster distribution plan of available resources, and be the logical elements connection that described level cluster distribution plan is set up each memory device; And, arrange in Replica Placement Strategy information, further comprising the steps of: Replica Placement Strategy information is set and comprises deposit data RuleOption, it comprises selects some target devices to deposit copy and the constraint condition to copy; Wherein, described constraint condition at least comprises selection one accumulation layer.

Preferably, in described data distribution method, described deposit data RuleOption also comprises input step and generates step; An integer X of described input step input; Described generation step generates the optional copy list of depositing according to described integer X; And, in the time of described distribute objects copy, carry out described input step.

Preferably, in described data distribution method, the distribution that regular specific data is placed in described deposit data RuleOption employing method definition; Each deposit data RuleOption comprises a series of operations that are applied to respective stored layer.

Preferably, in described data distribution method, arrange in the composition information of each accumulation layer and memory device thereof, also comprise step: to the each memory device in accumulation layer described in, with capacity and the performance of each memory device, its weight is set; And, arrange in Replica Placement Strategy information, also comprise step: according to each weight balancing distributed data object to each memory device.

Preferably, described data distribution method also comprises step: described distributed file system adopts object pool to manage each accumulation layer and memory device thereof; And all data objects in each object pool all arrange same object information, it comprises copy umber and distribution rule, and described object information is buffered in each client.

Preferably, in described data distribution method, each memory device arranges several placement groups; Adopt a hash function that the attribute key word of data object is mapped to placement group, a placement group is deposited multiple data objects; Employing consistance hash function is mapped to placement group number the memory device of actual store data object, according to placing group number and copy amount, ghost positional information; Wherein, first copy is primary copy, and other is from copy.

Preferably, in described data distribution method, each copy of described data object is placed into respectively on the memory device of different electric power system, controller or physical location.

Another technical scheme of the present invention is as follows: a kind of distributed file system, and it comprises object pool and some accumulation layers; Described in each, accumulation layer comprises some memory devices; Described object pool arranges control module, storage unit and linkage unit; Described control module is connected with described storage unit, for composition information and the Replica Placement Strategy information of each accumulation layer and memory device thereof are set, is stored in described storage unit; Described control module is also connected with described linkage unit, according to described composition information and described Replica Placement Strategy information distribute objects copy to the each memory device in each accumulation layer.

Adopt such scheme, the present invention adopts Replica Placement Strategy information, and distribute objects copy, to the each memory device in each accumulation layer, can distribute to make full use of available resources by optimization data, has very high market using value.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of one embodiment of the present of invention.

Embodiment

For the ease of understanding the present invention, below in conjunction with the drawings and specific embodiments, the present invention will be described in more detail.In accompanying drawing, provide preferred embodiment of the present invention.But the present invention can realize in many different forms, be not limited to the described embodiment of this instructions.On the contrary, providing the object of these embodiment is to make to the understanding of disclosure of the present invention more thoroughly comprehensively.

It should be noted that, when element is called as " being fixed on " another element, it can be directly on another element or also can have an element placed in the middle.When an element is considered to " connection " another element, it can be directly connected to another element or may have centering elements simultaneously.Term " vertical ", " level ", " left side ", " right side " and similar statement that this instructions uses are just for illustrative purposes.

Unless otherwise defined, all technology that this instructions uses are identical with the implication that belongs to the common understanding of those skilled in the art of the present invention with scientific terminology.The term using in instructions of the present invention in this instructions, just in order to describe the object of specific embodiment, is not for limiting the present invention.The term "and/or" that this instructions uses comprise one or more relevant Listed Items arbitrarily with all combinations.

As shown in Figure 1, one embodiment of the present of invention are, a kind of data distribution method of distributed file system, and it comprises the following steps: described distributed file system is divided into some accumulation layers, and each accumulation layer comprises some memory devices; The composition information of each accumulation layer and memory device thereof is set, and Replica Placement Strategy information; According to described composition information and described Replica Placement Strategy information, distribute objects copy is to the each memory device in each accumulation layer.Preferably, distribute objects copy is in each memory device of different power supply environment in same accumulation layer; And for example, distribute objects copy is in each memory device of different power supply environment in different accumulation layer; Preferably, before distribute objects copy, the loading condition of the each accumulation layer of poll; For example, start to pass judgment on the loading condition of each accumulation layer from the highest accumulation layer.The distributed file system of various embodiments of the present invention, i.e. file management system, the data distribution method of employing is pseudorandom data distribution method, can be efficiently with healthy and strong ground distribute objects copy in the storage cluster of structurized, level.Different from classic method is, the data distribution method that the application adopts does not rely on any type of per-file and per-object catalogue, it only needs a succinct hierarchy description about memory device composition and Replica Placement Strategy information, realize data and distributed completely, all memory locations of calculating object independently of any part data in large scale system; And, be static substantially for a small amount of metadata being required, when only having equipment to add or deleting, just can change.

Preferably, in metadata, store the composition information of each accumulation layer and memory device thereof, and Replica Placement Strategy information, like this, a small amount of metadata being required is static substantially, when only having equipment to add or deleting, just can change.Preferably, in the time that equipment adds or deletes, regenerate or arrange the composition information of each accumulation layer and memory device thereof, and Replica Placement Strategy information.Preferably, in described data distribution method, before distribute objects copy, also carry out following steps: select or revise at least one Replica Placement Strategy in described Replica Placement Strategy information.Like this, can, according to the actual conditions that use, decide in its sole discretion and adopt a certain Replica Placement Strategy information by user, or revise the one or more Replica Placement Strategy in described Replica Placement Strategy information, thereby obtain copy placement schemes more flexibly.

Preferably, in described data distribution method, described deposit data RuleOption also comprises input step and generates step; An integer X of described input step input; Described generation step generates the optional copy list of depositing according to described integer X; And, in the time of described distribute objects copy, carry out described input step.Like this, when distribute objects copy, user only need input an integer as X, generates step and automatically generates the optional copy list of depositing; For example, X represents the copy umber that need to distribute; And for example, the importance of X representative data object.

Preferably, in described data distribution method, the distribution that regular specific data is placed in described deposit data RuleOption employing method definition; Each deposit data RuleOption comprises a series of operations that are applied to respective stored layer.For the various scenes that adapt to use, the configuration of data Replica strategy and bottom hardware equipment etc., employing method definition placement rule (placement rules) makes the distribution of keeper or storage system specific data.Such as, if somebody selects twin target equipment to make the mirror image of 2 copies, 3 equipment that may have people can be chosen in 2 different pieces of information centers are made 3 copies, possible somebody is chosen on 6 equipment and makes RAID-4, etc.

Each rule all comprises a series of operations that are applied to corresponding level, under production environment, just if false code is in method 1.The parameter integer X that is input to method, is object name or other ID, such as is the ID (copy of these objects is on same equipment) of a group objects.The item (a normally bucket) in this storage level is selected in take (a) operation, and assignment is to vector variable

it is as the input of operation subsequently.Select (n, t) operation from middle iteration, is chosen in n the different item that the type under this level is t.Type and each bucket that memory device has is disclosed, revised have type field (being used for distinguishing different classes of buckets), such as some represent " rows ", and other represent " cabinets ".Belong to for each

i, select (n, t) operation can traversal r ∈ 1 ..., the items that n is all, circulation descending is taken office the buckets in the middle of what, and nested item in each bucket is selected on pseudorandom ground, use c (r, x) function, until obtain the item that the type of needs is t.Consequent different item are placed to input

in, or as the input of next-step operation, or operate and be put in result container by emit.

For example, set of rule is as shown in the table, three copies is distributed in three different cabinet of same row.

According to the defined rule of upper table, from root level, by select (1, row) select a bucket that single type is row, for example it has selected row2, ensuing operation is select (3, cabinet), select 3 different cabinets, cab21 under the row2 selecting before, cab23 and cab24, last select (1, disk) operate in these 3 cabinets and travel through, finally below they are each, select a disk.Final result is that 3 disks are distributed in three different cabinets, but in same row.This method also allows copy to be distributed simultaneously and by the type of container (rows, cabinets and shelves) constraint, this is all a useful characteristic to reliability and performance.The rale store target device that comprises multiple take, emit part takes out from different storage pools, and may be that remote copy is semantic needs, and for example a copy is at remote site; Or the installation of level, for example, the quick storage of approximately linear and slow high power capacity array.

Select (n, t) operation may travel through a lot of layers of storage level, in order to determine n the different items (from starting point) of given type t, circular treatment is by r ∈ 1, ..., n is parametrization partly, selects number of copies.In this process, may abandon and reselect items, the input parameter r ' having revised by use a: if item chooses from current set, for example conflict-select (n, t) result must be different, if an equipment failure, or apparatus overload.Such equipment is labeled as like that in cluster distribution plan, but stays in level, for fear of unnecessary Data Migration.Optionally shift the partial data of overload equipment, by pseudorandom abandon, the probability that cluster distribution plan is specified, the general excessive use of reporting with its is relevant.For the equipment of overload or inefficacy, balancedly heavily distribute item to storage cluster, by restarting the recurrence starting at select (n, t).Under the situation of conflict, in intra-level, attempt search this locality with parameter r ' alternately and avoid skew to have all data of how contingent subtree to distribute from conflict.

Preferably, described data distribution method also comprises step: described distributed file system adopts object pool to manage each accumulation layer and memory device thereof; And all data objects in each object pool all arrange same object information, it comprises copy umber and distribution rule, and described object information is buffered in each client.File management system object is managed by object pool (pool).All objects in each pond have same copy umber, distribution rule etc., and these information caches are in client, or these information caches are in server end, and client, in the time storing, connects described server end and obtains described information.

Preferably, user needs the name of named pool in the time of access object.Preferably, in described data distribution method, each memory device arranges several placement groups; Adopt a hash function that the attribute key word of data object is mapped to placement group, a placement group is deposited multiple data objects; Employing consistance hash function is mapped to placement group number the memory device of actual store data object, according to placing group number and copy amount, ghost positional information; Wherein, first copy is primary copy, and other is from copy.To pass through two-layer mapping (memory node is called OSD) from the key (attribute key word) of object to the server of final storage data.Ground floor is, through a hash function, key is mapped to Placement Group (PG, placement group).In similar other system of PG, in the concept of virtual partition, a PG deposits multiple objects, and each memory node has up to a hundred PG.The second layer is the main frame that is mapped to actual store data by consistance hash function PGID, for given PGID and copy amount, and can ghost positional information.Wherein first copy is main, other be from.Primary copy is responsible for receiving from the writing of client, and produces daily record and is synchronized to from copy.Write if there is multiple clients are concurrent, primary copy is also played the part of coordinator and is determined the concurrent order of writing.In the time that a small amount of machine is delayed machine, as consistance hash function, the position of the PG copy of generation does not have and changes very greatly not.Meanwhile, other copy of missing data is scattered in whole cluster, and this can utilize the network bandwidth of whole cluster while just having guaranteed polishing copy data.

Wherein, the method for file management system, is similar to and reaches equiblibrium mass distribution on probability according to the weight distribution data object of each equipment to memory device.Data allocations is represented, and the level cluster distribution plan of available resources controls, and therefrom forms logical elements.For example, may describe a few row's racks, rack has many machines, and machine has polylith plant tray.Data allocations strategy is defined as placing rule so: specify and from cluster, select how many target devices, copy is had to what constraint.Such as, may specify three mirror image copies to be placed on the physical equipment of different cabinet, like this, each physical equipment is not shared same power supply.A given integer input value X, the method for cluster will be exported the R list of sequence, for example, from n different memory device.Utilize many inputs integer hash function of strengthening, wherein, comprise input parameter X, make mapping fully decisive, and can only use cluster distribution plan, placement rule and X just can calculate independently.This distribution is pseudorandom, obtains between entry and Output rusults and there is no specific correlativity in other words on similar input or any equipment.Wherein, generate and go the copy of clustering to distribute, the equipment of sharing an entry is independently on other item.

File management system is designed to balanced distribute data to giving on the equipment of weight, guarantees to store on probability the equilibrium of utilization rate and device bandwidth resource.On the memory device of hierarchical directory, the placement of copy has very important impact to data security.By the bottom physical organization of mapped device, cluster method can be simulated the potential resources of relevant failure equipment.Typical resource comprises the physics degree of approximation, shares power supply and shared network.By by these information codings in cluster map, the placement rule of method can be distributed to object copies different inefficacy territories and maintain the data allocations of wanting.Preferably, in described data distribution method, each copy of described data object is placed into respectively on the memory device of different electric power system, controller or physical location.For example, the possibility losing efficacy in order to tackle consistance, wishes data trnascription to be placed in the shelves (rack or accumulation layer) of different electric power systems, controller or physical location.

The data distribution method that various embodiments of the present invention adopt, for optimization data distributes to make full use of available resources design, add or the situation of sweep equipment under recombination data efficiently, enforce flexible object copies constraint condition, the guarantee data security of maximum in the situation that of consistent inefficacy.Support numerous data safety mechanisms, comprise the mechanism of n copy, the demagnetization coding of the mechanism of RAID family or other form, or mixed method, such as RAID-10 etc.

In conjunction with the above-mentioned either method of application, another embodiment of the present invention is, a kind of distributed file system, and it comprises object pool and some accumulation layers; Described in each, accumulation layer comprises some memory devices; Described object pool arranges control module, storage unit and linkage unit; Described control module is connected with described storage unit, for composition information and the Replica Placement Strategy information of each accumulation layer and memory device thereof are set, is stored in described storage unit; Described control module is also connected with described linkage unit, according to described composition information and described Replica Placement Strategy information distribute objects copy to the each memory device in each accumulation layer.

Preferably, described distributed file system, it is storage system, according to default hierarchical rule, automatically carry out AUTOMATIC ZONING according to different storage mediums, in the time having new storage medium to join in storage system, can it be added in existing accumulation layer automatically or newly establish an accumulation layer, then, according to the loading condition of same layer and default load equilibrium condition, the partial data of this existing accumulation layer be moved to wherein.Preferably, described method, also according to the access frequency of data object, is moved described data object in each accumulation layer.Data Migration is exactly Mobile data, extracts data from file, cut section, hard disk or disk subsystem, puts into other storage medium, or is called physical site.For example, in the time that the access frequency of data reaches default threshold value, can Autonomic Migration Framework in the storage medium of upper strata or lower floor.If the frequency of the migration of data is too large or data volume is too large, can bring very large load to storage system, have a strong impact on the performance of system, so preferred, the priority of Data Migration operation is lower than the priority of data access operation.Each data object can have a data retention period in the time just entering storage or just moved on other storage mediums, during this period of time, system judges according to the rank of the access frequency of data and data whether retention period moves to other storage medium and which concrete layer medium after finishing.

And for example, distributed file system is set up storage medium classification, by the storage medium classification of different performance, is arranged in different accumulation layers, is used for storing data; Along with storage demand is risen in the shape of a spiral, it is obviously unpractical in high performance storage medium that all data are placed on, and layering storage is that crucial data are kept in high performance medium.The final purpose of layering storage is in order to economize on the use of funds, and according to the height of access frequency, data is placed on to different storage mediums in different periods, and different memory hierarchys, avoids the waste of hardware space and performance.Data are distributed in multilayer simultaneously, can avoid user and application program in the conflict that when storage access may occur, avoid damaging the performance of storage system.

Further, embodiments of the invention also comprise, each technical characterictic of the various embodiments described above, the distributed file system being mutually combined to form and data distribution method thereof.

In sum, this distributed file system and data distribution method thereof can be according to composition information and Replica Placement Strategy information, distribute objects copy is to the each memory device in each accumulation layer, what solve is the technical matters that ensured most possibly data security in the situation that of consistent inefficacy, the method realizes the improvement to inside computer system runnability by computer program, what reflect is to arrive structurized with healthy and strong ground distribute objects copy efficiently, in the storage cluster of level, what utilize is the technological means following the course of nature, having obtained data distributes completely, avoid the technique effect of loss of data.Therefore; distributed file system of the present invention and data distribution method thereof are that one realizes inside computer system, the improved solution of external performance by computer program; belong to the technical scheme of second section of regulation of Patent Law Article 2, belong to the object of patent protection.

It should be noted that, above-mentioned each technical characterictic continues combination mutually, forms the various embodiment that do not enumerate in the above, is all considered as the scope that instructions of the present invention is recorded; And, for those of ordinary skills, can be improved according to the above description or convert, and all these improvement and conversion all should belong to the protection domain of claims of the present invention.

Claims

1. a data distribution method for distributed file system, is characterized in that, comprises the following steps:

Described distributed file system is divided into some accumulation layers, and each accumulation layer comprises some memory devices;

The composition information of each accumulation layer and memory device thereof is set, and Replica Placement Strategy information;

According to described composition information and described Replica Placement Strategy information, distribute objects copy is to the each memory device in each accumulation layer.

2. data distribution method according to claim 1, is characterized in that, before distribute objects copy, also carries out following steps: select or revise at least one Replica Placement Strategy in described Replica Placement Strategy information.

3. data distribution method according to claim 2, it is characterized in that, arrange in the composition information of each accumulation layer and memory device thereof, further comprising the steps of: the composition information that each accumulation layer and memory device thereof are set is the level cluster distribution plan of available resources, and be the logical elements connection that described level cluster distribution plan is set up each memory device; And,

Arrange in Replica Placement Strategy information, further comprising the steps of: Replica Placement Strategy information is set and comprises deposit data RuleOption, it comprises selects some target devices to deposit copy and the constraint condition to copy; Wherein, described constraint condition at least comprises selection one accumulation layer.

4. data distribution method according to claim 3, is characterized in that, described deposit data RuleOption also comprises input step and generates step;

An integer X of described input step input;

Described generation step generates the optional copy list of depositing according to described integer X;

And, in the time of described distribute objects copy, carry out described input step.

5. data distribution method according to claim 4, is characterized in that, the distribution that regular specific data is placed in described deposit data RuleOption employing method definition; Each deposit data RuleOption comprises a series of operations that are applied to respective stored layer.

6. data distribution method according to claim 5, is characterized in that, arranges in the composition information of each accumulation layer and memory device thereof, also comprises step: to the each memory device in accumulation layer described in, with capacity and the performance of each memory device, its weight is set;

And, arrange in Replica Placement Strategy information, also comprise step: according to each weight balancing distributed data object to each memory device.

7. data distribution method according to claim 6, is characterized in that, also comprises step: described distributed file system adopts object pool to manage each accumulation layer and memory device thereof;

And all data objects in each object pool all arrange same object information, it comprises copy umber and distribution rule, and described object information is buffered in each client.

8. data distribution method according to claim 7, is characterized in that,

Each memory device arranges several placement groups;

Adopt a hash function that the attribute key word of data object is mapped to placement group, a placement group is deposited multiple data objects;

Employing consistance hash function is mapped to placement group number the memory device of actual store data object, according to placing group number and copy amount, ghost positional information;

Wherein, first copy is primary copy, and other is from copy.

9. data distribution method according to claim 8, is characterized in that, each copy of described data object is placed into respectively on the memory device of different electric power system, controller or physical location.

10. a distributed file system, is characterized in that, comprises object pool and some accumulation layers;

Described in each, accumulation layer comprises some memory devices;

Described object pool arranges control module, storage unit and linkage unit;

Described control module is connected with described storage unit, for composition information and the Replica Placement Strategy information of each accumulation layer and memory device thereof are set, is stored in described storage unit;

Described control module is also connected with described linkage unit, according to described composition information and described Replica Placement Strategy information distribute objects copy to the each memory device in each accumulation layer.