CN110019082A

CN110019082A - The more copy storage methods of distribution of file data

Info

Publication number: CN110019082A
Application number: CN201710636934.8A
Authority: CN
Inventors: 刘哲; 胡伦良; 张海斌
Original assignee: Putian Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd; Putian Information Technology Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2019-07-16

Abstract

The application proposes the more copy storage methods of distribution of file data.Method includes: to preset multiple file access frequency ranks and the corresponding duplicate of the document number of each file access frequency rank, wherein the more high corresponding duplicate of the document number of file access frequency rank is more；The file newly uploaded is received, the file access frequency rank that file is arranged is highest level；Fragment is carried out to file using distributed type file system client side, and according to the corresponding duplicate of the document number of file access frequency of highest level, more copy storages is executed to the fragment of file and are handled；The file access frequency of maintenance documentation；When the file access frequency rank for finding file reduces, according to the corresponding duplicate of the document number of rank after reduction, it determines the number of copies that file need to be deleted, according to the number of copies that need to be deleted, the delete processing being consistent with the number of copies that need to be deleted is carried out to the copy of all fragments of file.The application reduces the more copy carrying costs of distribution of file data.

Description

The more copy storage methods of distribution of file data

Technical field

The present invention relates to the more copy storage sides of the distribution of distributed document technical field of memory more particularly to file data Method.

Background technique

Hadoop is the tool of a parallel processing mass data.HDFS(the Hadoop Distributed File System, Hadoop distributed file system) it is mainly used for the analysis of large data files.Its main feature is that by a super large file point Solution is deployed on more low configuration machines at multiple small documents and is stored and analyzed.

Distributed more replication policies are referred to for each of multiple small documents for being resolved by super large file small text Part, is respectively created multiple copies, and multiple copy is respectively stored on different machines.

The defect of distributed more replication policies is mainly reflected in:

One, HDFS global storage higher cost.By taking common 3 replication policy as an example, actually required memory space is storage 3 times of data capacity, directly increase storage hardware cost.

Two, HDFS treatment effeciency reduces.More replication policies increase the settling time of file index, and increase The memory consumption of NameNode (file record node).Wherein, NameNode is used to create and store the description letter of each copy Breath.

Three, load balance ability is insufficient.More replication policies maintain equal number of copy amount to the data in system Rather than treat with a certain discrimination, cause system that can not dynamically adjust copy number according to demand.

Summary of the invention

The present invention provides the more copy storage methods of distribution of file data, to reduce the more copies of distribution of file data Carrying cost.

The technical scheme of the present invention is realized as follows:

A kind of more copy storage methods of distribution of file data, preset multiple file access frequency ranks, are arranged The corresponding duplicate of the document number of each file access frequency rank, wherein the more high corresponding duplicate of the document of file access frequency rank Number is more, this method comprises:

The file newly uploaded is received, the file access frequency rank that the file is arranged is highest level；

Fragment, and the file access according to highest level are carried out to the file using distributed type file system client side The corresponding duplicate of the document number of frequency executes more copy storages to the fragment of the file and handles；

Safeguard the file access frequency of the file；

When the file access frequency rank for finding the file reduces, according to the corresponding duplicate of the document of rank after reduction Number, determines the number of copies that the file need to be deleted, according to the number of copies that need to be deleted, to the copy of all fragments of the file Carry out the delete processing being consistent with the number of copies that need to be deleted.

One file is set for each rank in advance and describes queue；

The file access frequency rank of the setting file further comprises for highest level:

The corresponding file of file access frequency that the file description information of the file is put into highest level is described into queue In, the file description information includes: the receiving time of filename and file；

The fragment to the file executes more copy storage processing

For the file each fragment distribution store the fragment each copy back end, by the every of each fragment A copy is respectively stored on corresponding back end.

The method further includes:

The reading file instruction of user's input is received, the instruction carrying period describes team in file according to the period Corresponding filename is searched in column；

The data section where each copy of each fragment of this document is inquired using distributed type file system client side Point, and a back end is selected in the back end where all copies of determining each fragment, from selected number According to a copy for reading the fragment on node；

The copy of all fragments of this document that distributed type file system client side is read is merged into one completely File is supplied to user.

It is described according to the number of copies that need to be deleted, the copy of all fragments of the file is carried out and the pair that need to be deleted The delete processing that this number is consistent includes:

According to the filename of the file, the back end where each copy of each fragment of this document is determined, The back end being consistent with the number of copies that need to be deleted is selected in back end where all copies of determining each fragment, it will Copy on selected back end is deleted.

The number of copies that the determination file need to be deleted includes:

According to the filename lookup of the file to the file description information of the file, the file found is described to believe File where breath describes current accessed frequency rank of the access frequency rank as the file of queue, calculates the file The difference of the corresponding number of copies of current accessed frequency rank number of copies corresponding with the rank after reduction, which is the text The number of copies that part need to be deleted.

Access frequency maintenance period is preset,

The access frequency of the maintenance file includes:

When each access frequency maintenance period starts, the access times of All Files and access frequency are reset to 0, when When receiving the reading file instruction for a file, the access times of this document are added 1, in current accessed frequency maintenance period At the end of, calculate the access times/access frequency of this document in access frequency=current accessed frequency maintenance period of each file The length of rate maintenance period.

It is described to preset multiple file access frequency ranks, the corresponding file pair of each file access frequency rank is set This number are as follows: file manager presets multiple file access frequency ranks, and it is corresponding that each file access frequency rank is arranged Duplicate of the document number；

Described to receive the file newly uploaded, the file access frequency rank that the file is arranged includes: for highest level

File manager receives the file newly uploaded, and the file access frequency rank of the file is arranged as the superlative degree , the number of copies for determining the file is not the corresponding duplicate of the document number of file access frequency of highest level, by the file and The number of copies of the file is sent to distributed type file system client side；

It is described that fragment is carried out to the file are as follows: distributed type file system client side carries out fragment to the file；

And the corresponding duplicate of the document number of the file access frequency according to highest level, the fragment of the file is executed More copy storages are handled

All segmental identifications and number of copies of the file are sent to file record section by distributed type file system client side Point, and returning according to file record node is that all back end that each fragment distributes identify, by each of each fragment Copy is respectively stored on corresponding back end；

The access frequency of the maintenance file are as follows: file manager safeguards the access frequency of the file；

When the file access frequency rank of the discovery file reduces, according to the corresponding file pair of rank after reduction This number, the number of copies for determining that the file need to be deleted include:

File manager finds that the file access frequency rank of the file reduces, according to the corresponding text of rank after reduction Part number of copies determines the number of copies that the file need to be deleted, and the filename of the file and the number of copies that need to be deleted are sent to Distributed type file system client side；

Distributed type file system client side determines all segmental identifications of the file, by all fragment marks of the file The number of copies known and need to deleted is sent to file record node, is the need of each fragment selection according to the return of file record node Back end mark where the copy of deletion, deletes the copy on corresponding data node.

The distributed file system is Hadoop distributed file system HDFS.

The present invention is by the access frequency of maintenance documentation, and the dynamic change of the access frequency according to file, dynamic are deleted The copy of file reduces the more copy carrying costs of distribution of file data, improves the processing effect of distributed file system Rate and load balance ability.

Detailed description of the invention

Fig. 1 is the more copy storage method flow charts of distribution of file data provided by the embodiments of the present application；

Fig. 2 is the method flow of the written document of the more copy storages of distribution of file data provided by the embodiments of the present application Figure；

Fig. 3 is the method flow of the reading file of the more copy storages of distribution of file data provided by the embodiments of the present application Figure；

Fig. 4 is the method for the Dynamic Maintenance copy of the more copy storages of distribution of file data provided by the embodiments of the present application Flow chart.

Specific embodiment

With reference to the accompanying drawing and specific embodiment the present invention is further described in more detail.

Fig. 1 is the more copy storage method flow charts of distribution of file data provided by the embodiments of the present application, specific to walk It is rapid as follows:

Step 100: presetting multiple file access frequency ranks, the corresponding text of each file access frequency rank is set Part number of copies, wherein the more high corresponding duplicate of the document number of rank is more.

Such as: three file access frequency ranks can be set, be referred to as: high frequency, intermediate frequency and low frequency.

Step 101: receiving the file newly uploaded, the file access frequency rank that this document is arranged is highest level.

Step 102: fragment being carried out to this document using distributed type file system client side, and is accessed according to maximum file The corresponding duplicate of the document number of frequency rank executes more copy storages to the fragment of this document and handles.

Step 103: safeguarding the access frequency of this document.

Step 104: when the file access frequency rank for finding this document reduces, according to the corresponding text of rank after reduction Part number of copies determines the number of copies that this document need to be deleted, the duplicate of the document number that need to be deleted according to this, to all fragments of this document Copy carry out the delete processing that is consistent with the duplicate of the document number that this need to be deleted.

Fig. 2 is the method flow of the written document of the more copy storages of distribution of file data provided by the embodiments of the present application Figure, the specific steps of which are as follows:

Step 200: the file access frequency of multiple ranks being set on file manager in advance, and each file is set and is visited It asks the range of the other file access frequency of frequency level, and the corresponding duplicate of the document number of each file access frequency rank is set, and, It is in advance that each file access frequency rank is respectively created a file and describes queue on file manager.

Such as: three file access frequency ranks can be set, be referred to as high frequency, intermediate frequency and low frequency, and file is set Access frequency rank corresponding frequency range when being respectively high, medium and low frequency, setting file access frequency rank be respectively it is high, in, Corresponding duplicate of the document number when low frequency.

Obviously, file access frequency rank is higher, and corresponding access frequency is higher, and corresponding duplicate of the document number is got over It is more.I.e. file access frequency is higher, and the number of copies of file is more.

Step 201: file manager receives the file newly uploaded, stamps timestamp for this document, determines this document Access frequency rank is highest level, and the file description information of this document is put into the corresponding file of highest access frequency rank and is retouched State queue.

Here corresponding timestamp is the time for receiving file.

Here file description information includes: filename, timestamp etc..

Step 202: file manager is by this document and this document corresponding number of copies (i.e. file access of highest level The corresponding number of copies of frequency) it is sent to HDFS Client (client).

Step 203:HDFS Client sends file record creation instruction to NameNode, which carries this document Filename, timestamp etc..

Step 204:NameNode receives this document record creation instruction, the filename carried according to the instruction, timestamp Deng one file record of creation, and instruction is returned to HDFS Client and creates successful file record creation results messages.

Step 205:HDFS Client receives this document record creation object command, and this document is divided into multiple fragments, and The description information of each fragment and the corresponding number of copies of this document are carried in DataNode (back end) distribution instruction It is sent to NameNode.

The description information of each fragment includes: segmental identification, fragment size etc..

Step 206:NameNode receive the DataNode distribution instruction, according to the instruction carry fragment description information with And number of copies, the DataNode of each copy of the fragment is stored for the distribution of each fragment, and DataNode allocation result is returned HDFS Client is given, and saves the DataNode allocation result.

DataNode allocation result contains: for the DataNode information of each copy distribution of each fragment, that is, including : the corresponding relationship between the segmental identification of each fragment and the DataNode mark distributed for all copies of the fragment.

Step 207:HDFS Client receives the DataNode allocation result, is indicated according to the DataNode allocation result It is the mark of the DataNode of each copy distribution of each fragment of this document, each copy of each fragment is deposited respectively It stores up on corresponding DataNode.

Step 208:HDFS Client saves the fragment description information of this document.

Here, the fragment description information of file includes at least point of the filename of this document and all fragments of this document Piece mark.

Fig. 3 is the method flow of the reading file of the more copy storages of distribution of file data provided by the embodiments of the present application Figure, the specific steps of which are as follows:

Step 301: file manager receives the reading file instruction of user's input, which carries the period.

Step 302: the period that file manager is carried according to the reading file instruction describes to search in queue in each file The period corresponding file description information.

Step 303: file manager is sent according to the filename in the file description information found to HDFSClient File instruction is read, which carries file name.

Step 304:HDFS Client is looked into the fragment description information for each file that itself is saved according to file name The corresponding all segmental identifications of file name are looked for, is sent to NameNode and reads copy acquisition message, which carries this document All segmental identifications.

Step 305:NameNode receive the reading copy obtain message, for the message carry each segmental identification, The corresponding all DataNode marks of the segmental identification are searched in the DataNode allocation result that itself is saved, it is secondary according to default reading This selection principle, one DataNode mark of selection in all DataNode found mark；It is carried when for the message All segmental identifications all selected DataNode mark after, the DataNode selected for all segmental identifications is identified and is taken Band returns to HDFS Client in reading copy acquisition response message.

Since there may be multiple copies for each fragment of file, and each copy is respectively stored in a DataNode On, therefore, NameNode needs to select one wherein after finding each segmental identification and having corresponded to multiple DataNode marks A, selection principle (i.e. above-mentioned reading copy selection principle) can be shortest route, it may be assumed that selected DataNode mark corresponds to DataNode and HDFS Client between routing it is most short, to facilitate HDFS Client to read the pair in the shortest time This, certain selection principle is also possible to other principles, pre-defines.

Step 306:HDFS Client receives the reading copy and obtains response message, is this document according to message carrying All segmental identifications selection DataNode mark, the copy of each fragment is read from corresponding DataNode respectively, read It takes complete, the copy of all fragments is merged into a complete file and returns to file manager.

Step 307: the HDFS Client file returned is supplied to user by file manager.

Fig. 4 is the method for the Dynamic Maintenance copy of the more copy storages of distribution of file data provided by the embodiments of the present application Flow chart, the specific steps of which are as follows:

Step 401: file manager presets access frequency maintenance period, starts in each access frequency maintenance period When, the access times of All Files and access frequency are reset to 0.

Step 402: when receiving the reading file instruction for a file, file manager is secondary by the access of this document Number plus 1.

Step 403: file manager calculates the access frequency of each file at the end of current accessed frequency maintenance period The access times of this document/access frequency maintenance period length in=current accessed frequency maintenance period.

Step 404: for any file, if file manager at the end of current accessed frequency maintenance period, according to meter Frequency range where the file access frequency of this document of calculating confirms that the access frequency rank of this document reduces, then calculates Between the current corresponding number of copies of access frequency rank of this document number of copies corresponding with the access frequency rank that need to be reduced to Difference, the number of copies which need to be deleted as this document.

Queue can be described to corresponding file according to the filename lookup of this document, this document is described to the access frequency of queue Current accessed frequency rank of the rate rank as this document.

Step 405: file manager issues copy to HDFS Client and deletes instruction, which carries the text of this document Part name and number of copies need to be deleted.

Step 406:HDFS Client receives the copy and deletes instruction, according to the filename that the instruction carries, itself is protecting In the fragment description information for each file deposited, the corresponding all segmental identifications of this document are searched, is sent to NameNode and deletes pair This acquisition message, the message carry the corresponding all segmental identifications of this document and need to delete number of copies.

Step 407:NameNode receives the deletion copy acquisition instruction, for the instruction carry each segmental identification, In the DataNode allocation result that itself is saved, the corresponding all DataNode marks of the segmental identification are searched, according to default Deletion copy selection principle, in all DataNode mark found, selection with need to delete number of copies equal number of DataNode mark；After all having selected DataNode to identify for all segmental identifications, by what is selected for all segmental identifications DataNode mark, which is carried, returns to HDFS Client in deleting copy acquisition response message, meanwhile, update itself preservation The corresponding DataNode allocation result of all segmental identifications of this document.

That is, NameNode needs to select m DataNode mark for each segmental identification if need to delete number of copies is m, Selection principle (i.e. above-mentioned deletion copy selection principle) may is that longest path by principle, i.e., selected DataNode mark Routing longest between corresponding DataNode node and HDFS Client, even m > 1, then NameNode according to longest path by Principle successively selects corresponding DataNode node and HDFS in the corresponding multiple DataNode marks of segmental identification The longest NodeNode mark of routing between Client, until having selected m DataNode node.

Step 408:HDFS Client receives the deletion copy acquisition instruction, is all fragments according to instruction carrying The DataNode mark for identifying selection, sends to each DataNode delete copy instruction respectively.

After step 409:HDFS Client receives the copy deletion completion message that all DataNode are sent, to file Manager sends the copy deletion completion message for carrying filename.

Step 410: file manager receives the copy and deletes completion message, is accessed according to the filename of this document corresponding The other file of frequency level describes the file description information that this document is found in queue, and this document description information is moved to this article The file for the access frequency rank that part is reduced to describes in queue.

The application's has the beneficial effect that:

By initial setting up and the access frequency of maintenance documentation, and the dynamic change of the access frequency according to file, dynamic The copy for deleting file reduces the more copy carrying costs of distribution of file data, improves the place of distributed file system Manage efficiency and load balance ability.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of more copy storage methods of distribution of file data, which is characterized in that preset multiple file access frequencies The corresponding duplicate of the document number of each file access frequency rank is arranged, wherein the file access frequency rank the high corresponding in rank Duplicate of the document number is more, this method comprises:

Fragment, and the file access frequency according to highest level are carried out to the file using distributed type file system client side Corresponding duplicate of the document number executes more copy storages to the fragment of the file and handles；

Safeguard the file access frequency of the file；

When the file access frequency rank for finding the file reduces, according to the corresponding duplicate of the document number of rank after reduction, Determine the number of copies that the file need to be deleted, according to the number of copies that need to be deleted, to the copies of all fragments of the file into The delete processing that row is consistent with the number of copies that need to be deleted.

2. the method according to claim 1, wherein a file, which is arranged, for each rank in advance describes queue；

The corresponding file of file access frequency that the file description information of the file is put into highest level is described in queue, institute State the receiving time that file description information includes: filename and file；

The fragment to the file executes more copy storage processing

For the file each fragment distribution store the fragment each copy back end, by each pair of each fragment Originally it is respectively stored on corresponding back end.

3. according to the method described in claim 2, it is characterized in that, the method further includes:

The reading file instruction of user's input is received, the instruction carrying period describes in queue according to the period in file Search corresponding filename；

The back end where each copy of each fragment of this document is inquired using distributed type file system client side, and A back end is selected in the back end where all copies of determining each fragment, from selected back end A upper copy for reading the fragment；

The copy of all fragments of this document that distributed type file system client side is read is merged into a complete file It is supplied to user.

4. according to the method described in claim 2, it is characterized in that, described according to the number of copies that need to be deleted, to the file The copies of all fragments carry out the delete processing that is consistent with the number of copies that need to be deleted and include:

According to the filename of the file, the back end where each copy of each fragment of this document is determined, in determination Each fragment all copies where back end in select the back end that is consistent with the number of copies that need to be deleted, will selected by Copy on the back end selected is deleted.

5. according to the method described in claim 2, it is characterized in that, the number of copies that the determination file need to be deleted includes:

According to the filename lookup of the file to the file description information of the file, by the file description information found institute Current accessed frequency rank of the access frequency rank as the file of queue is described in file, calculates the current of the file The difference of the corresponding number of copies of access frequency rank number of copies corresponding with the rank after reduction, the difference are that the file needs The number of copies of deletion.

6. the method according to claim 1, wherein preset access frequency maintenance period,

The access frequency of the maintenance file includes:

When each access frequency maintenance period starts, the access times of All Files and access frequency are reset to 0, work as reception To be directed to a file reading file instruction when, the access times of this document are added 1, are terminated in current accessed frequency maintenance period When, calculate the access times/access frequency dimension of this document in access frequency=current accessed frequency maintenance period of each file Protect the length in period.

7. the method according to claim 1, wherein described preset multiple file access frequency ranks, if Set the corresponding duplicate of the document number of each file access frequency rank are as follows: file manager presets multiple file access frequency grades Not, and the corresponding duplicate of the document number of each file access frequency rank is set；

File manager receives the file newly uploaded, and the file access frequency rank that the file is arranged is highest level, really The number of copies of the fixed file is the corresponding duplicate of the document number of file access frequency of highest level, by the file and the text The number of copies of part is sent to distributed type file system client side；

And the corresponding duplicate of the document number of the file access frequency according to highest level, the fragment of the file is executed mostly secondary This storage is handled

All segmental identifications and number of copies of the file are sent to file record node by distributed type file system client side, and Returning according to file record node is that all back end that each fragment distributes identify, by each copy of each fragment point It Cun Chu not be on corresponding back end；

When the file access frequency rank of the discovery file reduces, according to the corresponding duplicate of the document of rank after reduction Number, the number of copies for determining that the file need to be deleted include:

File manager finds that the file access frequency rank of the file reduces, according to the corresponding file pair of rank after reduction This number determines the number of copies that the file need to be deleted, and the filename of the file and the number of copies that need to be deleted are sent to distribution Formula file system client；

It is described according to the number of copies that need to be deleted, the copy of all fragments of the file is carried out and the number of copies that need to be deleted The delete processing being consistent includes:

Distributed type file system client side determines all segmental identifications of the file, by all segmental identifications of the file and The number of copies that need to be deleted is sent to file record node, and returning according to file record node is that each fragment selection needs to delete Copy where back end mark, delete corresponding data node on copy.

8. the method according to the description of claim 7 is characterized in that the distributed file system is Hadoop distributed document System HDFS.