CN109634911A

CN109634911A - A kind of storage method based on HDFS CD server

Info

Publication number: CN109634911A
Application number: CN201811443283.1A
Authority: CN
Inventors: 王子炫; 张育平
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-16

Abstract

The present invention discloses a kind of storage method based on HDFS CD server, includes the following steps: step 1, reads the file that user newly uploads, and is that each file is tagged；Step 2, the file information catalogue is established according to file label, and file is stored in disk；Step 3, periodically file in disk is scanned, and updates the temperature of file；Step 4, it carries out the judgement of file temperature: when file temperature is less than 0, file being migrated according to the modification position of file, disk flag bit and CD server flag bit；When file temperature is greater than 0, keep position of the file in disk constant.The advantages of such storage method combination disk and HDFS CD server, migrates unused cold data frequent in system to HDFS CD server, reduces subscriber response time.

Description

A kind of storage method based on HDFS CD server

Technical field

The invention belongs to memory system technologies field, in particular to a kind of storage method based on HDFS CD server.

Background technique

It With the fast development of internet and is widely applied, the increasing of global total amount of data also an explosion occurred therewith property It is long.In IDC (Internet Data Center) survey report, only 2013 Nian Yinian, the total amount of data that the whole world generates just reach 4.4ZB is arrived, and this numerical value is just being increased with the speed every two years doubled, it is contemplated that arrive the year two thousand twenty global metadata total amount It is up to 44ZB.The growth of data not only in terms of the data storage device on increase the carrying cost of data center, simultaneously Also huge test is brought in terms of data maintenance cost and Information Security.And the 80% of user visits request and concentrates on these numbers According to 20% on, 80% data in addition, which are stored in disk array, will increase carrying cost.

Currently based in the big data storage system of optical storage media, the Hadoop distributed file system based on CD server (HDFS CD server) is the one kind being most widely used, and HDFS CD server is relative to conventional optical disc library in memory capacity and transmission speed Degree aspect has obtained very big promotion, but due to distributed system storage organization and CD server physical structure, when user visits When asking a certain file, file data blocks storage location query time and CD server mechanical arm fetch and deliver the disk time and increase user response Time has seriously affected user experience.

Summary of the invention

The purpose of the present invention is to provide a kind of storage method based on HDFS CD server, in conjunction with disk and HDFS light The advantages of making an inventory of goods in a warehouse migrates unused cold data frequent in system to HDFS CD server, reduces subscriber response time.

In order to achieve the above objectives, solution of the invention is:

A kind of storage method based on HDFS CD server, includes the following steps:

Step 1, the file that user newly uploads is read, is that each file is tagged；

Step 2, the file information catalogue is established according to file label, and file is stored in disk；

Step 3, periodically file in disk is scanned, and updates the temperature of file；

Step 4, the judgement of file temperature is carried out: when file temperature is less than 0, according to the modification position of file, disk flag bit File is migrated with CD server flag bit；When file temperature is greater than 0, keep position of the file in disk constant.

In above-mentioned steps 3, the calculation method of file temperature is:

Wherein, fileHeat1 is the temperature that file updates, and fileHeat0 is the initial temperature of file, t_scanFor on file Secondary sweep time, t_visitFor file last time physical access time, t_nowIndicate the file Current Scan time, visitNum is disk The accessed number of interior file.

In above-mentioned steps 4, when file temperature is less than 0, migrated according to file-related information:

As changFlag=1, hddFlag=1, bdFlag=0, file is stored in disk array, does not access turn for a long time It changes cold data into, HDFS CD library module is transferred to migrate；

As changFlag=0, hddFlag=1, bdFlag=1, file is restored by HDFS CD server to disk battle array Data in arranging, and without repeating imprinting；

As changFlag=1, hddFlag=1, bdFlag=1, file is restored by HDFS CD server to disk battle array Data in arranging, and by the imprinting again of HDFS CD library module, and original the file information is covered；

Wherein, changFlag indicates modification position, and hddFlag is disk flag bit, and bdFlag is CD server flag bit.

Above-mentioned storage method further includes step 5: carrying out the judgement of migrated file size, if file is greater than 128M, file is known as Big file then will carry out imprinting in big file uploading to HDFS CD server；If file is less than 128M, file is known as small documents, then It is migrated by small documents merging.

It further include the file information catalogue being inserted into small documents information in HDFS CD library module in above-mentioned steps 5.

It further include All Files size under the second level tag directory for judging new insertion small documents information in above-mentioned steps 5, if File is greater than 128M, then carries out small documents merging treatment；If being less than 128M, it is not processed.

In above-mentioned steps 5, further includes: big file index catalogue and big filename are uploaded in HDFS CD server NameNode node, NameNode node save big file index catalogue and big filename in node；Inquire current DataNode The behaviour in service and record information of node select optimal DataNode node to establish logical between HDFS CD library module Letter carries out specific file imprinting operation.

Above-mentioned storage method further include: be located at other under same big file when small documents are accessed, or with small documents When small documents are accessed, by the filename of big file, big file index catalogue belonging to NameNode querying node small documents and Big file storage location establishes the communication that DataNode node is established between HDFS CD library module, by large file and Index list restores to disk space.

Above-mentioned storage method further includes step 10: big file being reduced to small documents according to big file index catalogue, and will Small documents hot value is updated to initial value.

After adopting the above scheme, present invention has the advantage that

(1) present invention improves data deposit speed, and data are stored in disk buffering first in the system of deposit, then lead to It crosses the processing such as data classification and small documents merging to be burnt in CD again, data are greater than magnetic in disk and the direct transmission speed of disk Disk is to HDFS CD server transmission speed；

(2) The present invention reduces CD servers to take disk number, will be to be suitble to HDFS CD with the file mergences of same label The big file that inventory takes, and the big file with same label is burnt in same CD, the file in same CD has Stronger relevance, according to spatial locality principle, the access of system continuously several times very maximum probability is concentrated in same CD, from And reaches reduction mechanical arm and take disk number purpose.CD server is also avoided by the way of imprinting in big file set simultaneously frequently to take Disk；

(3) present invention reduces subscriber response times, on the one hand, only file need to be stored in magnetic when user is stored in file Disk, for subsequent file imprinting certain customers do not need be concerned about, on the other hand, by caching technology and file prefetch by Next file that system may access is prefetched to disk in advance, is reduced system and is accessed HDFS CD server number, to reach Reduce the purpose of subscriber response time.

Detailed description of the invention

Fig. 1 is schematic structural view of the invention；

Fig. 2 is work flow diagram of the invention；

Fig. 3 is the structural schematic diagram of file label in the present invention.

Specific embodiment

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.

In addition, direction or positional relationship all in the text being previously mentioned in an embodiment of the present invention are the position based on attached drawing Relationship is set, only for facilitating the description present invention and simplifying description, rather than implies or imply that signified device or element are necessary The specific orientation having, is not considered as limiting the invention.

The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

As shown in Figure 1, storage system of one of the embodiment of the present invention based on HDFS CD server include memory, disk and HDFS CD server, the disk are used for storage file, including disk management module, HDFS CD library module, document classification module With file migration module；The disk management module for being managed to file in disk, and be responsible for the storage system with Communication between user；The HDFS CD library module between disk and HDFS CD server for communicating, the document classification mould Block is used to for the file in disk being divided into cold data and dsc data, and the HDFS CD server is described for storing the cold data Migration of the file migration module for file between HDFS CD server and disk.

Subscriber response time is reduced in this way, on the one hand, only file need to be stored in disk when user is stored in file, for Subsequent file imprinting certain customers do not need to be concerned about, on the other hand, are prefetched by caching technology and file and connect system The file that getting off may access is prefetched to disk in advance, is reduced system and is accessed HDFS CD server number, is used to reach reduction The purpose of family response time.

Since unit carrying cost, Information Security, service life etc. have apparent advantage, the present embodiment will Tertiary storage device of the HDFS CD server as disk, document classification module are periodically scanned disk, and will be in disk Cold data transfers to file migration module to migrate, migration of the file migration module for file between HDFS CD server and disk.

The HDFS CD library module is used to communicate between disk and HDFS CD server, including task forwarding, small documents are closed And file imprinting, file access pattern etc..When system accesses a certain file, whether pass through disk management module check file first In disk, pass through HDFS CD server locating file if not finding.

All files being burned to, are stored in mainly from for the first time in storage catalogue recording disc in HDFS CD library module The file of system and needs again imprinting file.

Further, the storage system further includes catalog generation module, in disk management module and HDFS CD File storage catalogue is established in library module.Storage catalogue recording disc All Files information in disk management module, it is convenient to magnetic File in disk is managed, and can find file-related information by file tag information with quick search the file information.Virtually All file-related informations being burned to are recorded in memory module, are conducive to small documents and are merged and establish relevance between file. All Files in logging modle under root, level-one label includes all second level label relevant informations of the label, under second level label Include All Files relevant information chained list under the label.

It should be noted that file is in the system of deposit, system passes through natural language to file content and filename first Then processing, the most several words of statistics frequency of occurrence are that each file stamps level-one label and two according to term database Grade label, the range of level-one label are greater than second level label, include multiple second level labels under level-one label.The file of same label it Between there is relevance, the smaller relevance of the range of label is stronger.Virtual HDFS CD library module, disk management module are all with text It is designed based on part label, as shown in Figure 3.

Further, the storage system further includes file mergences unit, for by the small documents of same label in disk Merge into big file.File mergences unit is that the cold data file generated to nearest a period of time caches, using with phase The big file of suitable HDFS CD server storage is merged into the small documents of label, and stamps identical timestamp deposit HDFS CD Library.HDFS can be well solved using small documents merging mode and CD server handles the inefficient problem of small documents, simultaneously There is stronger relevance with the file under same label and identical time stamp, when some file quilt in wherein HDFS CD server When system accesses, big file locating for this document is prefetched to disk array, and updating file temperature is initial value, because same Small documents under big file have stronger relevance, it is possible to reduce system accesses HDFS CD server number, improves the visit of system Ask speed.

Setting improves data deposit speed in this way, and data are stored in disk buffering first in the system of deposit, then lead to It crosses the processing such as data classification and small documents merging to be burnt in CD again, data are greater than magnetic in disk and the direct transmission speed of disk Disk is to HDFS CD server transmission speed.Meanwhile reducing CD server and taking disk number, it is suitable by the file mergences with same label The big file of HDFS CD server access is closed, and the big file with same label is burnt in same CD, in same CD File have stronger relevance, according to spatial locality principle, the access of system continuously several times very maximum probability is concentrated on together In one CD, disk number purpose is taken to reach and reduce mechanical arm.It is also avoided by the way of imprinting in big file set simultaneously CD server frequently takes disk.

Storage system needs imprinting file, first retrieval HDFS CD library module root directory for new, finds pair Tag directory is answered, if not finding, then creates corresponding tag directory.Then respective file information is inserted under tag directory, Then whether detection label capacity reaches imprinting condition, and when meeting imprinting requirement, All Files under the label are merged into greatly File is established big file to the concordance list of small documents, is stabbed by current big file creation time settling time, by big file and index Information table is stored in HDFS CD server.After the completion of big file imprinting, respective file in disk is deleted, and update corresponding associated documents Information.

In general, the data being commonly used are known as hot (hot) data, by rarely needed data referred to as cold (cold) number According to.The method of traditional classification be according between current file sweep time and physical access time last time interval or twice in succession Time interval between physical access determines file status, the document classification module for realizing same file cold data with Conversion between dsc data.

But migration cost of the file among disk array and HDFS CD server is also our aspects in need of consideration.Example Such as, it is assumed that a file is that cold state is stored in HDFS CD server, if be accessed once, according to hot/cold model meeting This document state is converted into hot type, is then migrated back in disk array, if next certain time this document not by Access, this document is divided into cold data, to be migrated in CD server.The migration burden that will increase system in this way, also can The space waste of HDFS CD server is caused, so determining file heat using according to file scan time and file physical access time Angle value determines file temperature state further according to hot value, while determining that migration strategy avoids file from repeating to carve according to changFlag Record.Therefore, what the conversion time of the cold and hot degree of file needed to be arranged is more reasonable.

Further, the sorting algorithm that the document classification module uses are as follows:

Wherein, fileHeat1 is the temperature that file updates, and fileHeat0 is the initial temperature of file, t_scanFor on file Secondary sweep time, t_visitFor file last time physical access time, t_nowIndicate the file Current Scan time, visitNum is disk The accessed number of interior file；When the fileHeat1 of file in disk is less than or equal to 0, file is divided into cold number in disk According to；When the fileHeat1 of file in disk is greater than 0, file is divided into dsc data in disk.The advantages of this arrangement are as follows File accidentally primary access and not accessing in the short time changes file hot value smaller, is not enough to change file temperature shape State can avoid thrashing.

ChangFlag indicates modification position, indicates whether file is modified in disk.When changFlag is equal to 0, Indicate that file is not modified；When changFlag is equal to 1, indicate that file is modified.When file uploads for the first time, modification Position is 1；When file restores out of HDFS CD server to disk, modification position is 0；When file is repaired in use Change, modification position is 1.

When file temperature is less than 0, is migrated, is included the following three types according to file-related information:

As changFlag=1, hddFlag=1, bdFlag=0, indicates that file is stored in disk array, do not access for a long time It is converted into cold data, transfers to virtual HDFS CD server reason module migration.

As changFlag=0, hddFlag=1, bdFlag=1, indicate that file is restored by HDFS CD server to disk Data in array, are not modified in access process in disk array, so need to only modify respective file information, will be updated File and file path in changFlag=1 and deletion disk array do not need to repeat imprinting.

As changFlag=1, hddFlag=1, bdFlag=1, indicate that file is restored by HDFS CD server to disk Data in array, have modified file content in disk array in access process, since optical disk property is can not to modify, so needing Virtual HDFS CD library module imprinting again is transferred to, and original the file information is covered.Wherein, hddFlag is disk Flag bit indicates whether there is this document in disk, as hddFlag=0, without this document in disk；As hddFlag=1, There is this document in disk.Similarly bdFlag is CD server flag bit, indicates whether there is this document in HDFS CD server；When When hddFlag=0, without this document in HDFS CD server；As hddFlag=1, there is this document in HDFS CD server.

Cooperate shown in Fig. 2, the present invention provides a kind of control method of storage system based on HDFS CD server, based on above-mentioned The storage system of HDFS CD server, includes the following steps: described in embodiment

S1: reading the file that user newly uploads, and is that each file is tagged；

S2: the file information catalogue is established according to file label, and file is stored in disk；

S3: being periodically scanned file in disk, and updates the temperature of file；

S4: the judgement of file temperature is carried out: when file temperature is less than 0, according to the modification position of file, disk flag bit and light The information such as flag bit of making an inventory of goods in a warehouse migrate file；When file temperature is greater than 0, keep position of the file in disk constant.

S5: carrying out the judgement of migrated file size, if file is greater than 128M, file is known as big file, then by big file uploading Imprinting is carried out in HDFS CD server, if file is less than 128M, file is known as small documents, then is moved by small documents merging It moves.

Small documents name is uploaded to NameNode node in HDFS CD server, NameSpace is saved under NameNode node Small documents information, and the behaviour in service and record information of current DataNode node are inquired, select optimal DataNode node The communication between HDFS CD library module is established, specific file imprinting operation is carried out.

S6: small documents information is inserted into HDFS CD server module management catalogue

Retrieve HDFS CD server module management catalogue in whether include corresponding level-one tag directory and second level tag directory, It is then inserted directly under corresponding second level tag directory if it exists, if it does not exist the then creation pair in HDFS CD server module management catalogue The label file directory answered, and small documents information is inserted under newly-built second level tag directory.

S7: All Files size under the second level tag directory of the new insertion small documents information of judgement then carries out if more than 128M Small documents merging treatment wouldn't process if being less than 128M.

Small documents merge into creation index information, and index information is as shown in table 1 below:

Table 1

FileMD5	FileSize	Offset
			Small documents 1	120K	0
Small documents 2	86K	206
			……	……	……
Small documents n	……	……

FileMD5 value is file MD5 value, has file path, file label, file creation time to calculate and obtains, guarantees text Part uniqueness in system.

FileSize is file size

Offset is offset of the small documents in big file

S8: big file imprinting

Big file index catalogue and big filename are uploaded to NameNode node, NameNode in HDFS CD server by S81 Node saves big file index catalogue and big filename in node；

S82, and the behaviour in service and record information of current DataNode node are inquired, select optimal DataNode node The communication between HDFS CD library module is established, specific file imprinting operation is carried out.

Transmitting file on disk is cached in DataNode node, then carries out specific file imprinting work. DataNode node first caches transmitting file on disk in CD server buffer area, then with the difference of same label Data block imprinting in different CD-ROM drives of the big file of timestamp can concurrently improve number by CD-ROM drive when accessing these files According to the transmission speed between HDFS CD server and disk.Data block under same big file is carved in same CD-ROM drive Record, such a large file most two disks of multispan, it is possible to reduce the operation of CD server removable disk improves data and reads in CD server Writing rate.

S9: file is read

When small documents are accessed, or with small documents be located at other small documents under same big file it is accessed when, pass through The filename of big file, big file index catalogue belonging to NameNode querying node small documents, big file storage location are established DataNode node establishes the communication between HDFS CD library module, and large file and index list are restored to disk sky Between.

S10: small documents extract, and big file is reduced to small documents according to big file index catalogue, and by small documents temperature Value is updated to initial value.

According to FileMD5 in big file index catalogue, FileSize, big file division is small documents by Offset information, And small documents information is inserted into disk management module directory, the label information of small documents is identical as big file.

It should be noted that passing through natural language processing technique in step S1 is that each small documents are tagged, Mei Gewen Part be all corresponding with relevant type in other words keyword relevant to this document, for example, delete it is relevant to this document it is non-must Want word, the key words occurred in statistics file.Then small documents message catalog, the catalogue are established according to small documents label Including the catalogue according to foundation such as creation time, access time and file temperatures.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of storage method based on HDFS CD server, it is characterised in that include the following steps:

Step 4, the judgement of file temperature is carried out: when file temperature is less than 0, according to the modification position of file, disk flag bit and light Flag bit of making an inventory of goods in a warehouse migrates file；When file temperature is greater than 0, keep position of the file in disk constant.

2. a kind of storage method based on HDFS CD server as described in claim 1, it is characterised in that: in the step 3, text The calculation method of part temperature is:

Wherein, fileHeat1 is the temperature that file updates, and fileHeat0 is the initial temperature of file, t_scanIt was swept for file last time It retouches the time, t_visitFor file last time physical access time, t_nowIndicate the file Current Scan time, visitNum is text in disk The accessed number of part.

3. a kind of storage method based on HDFS CD server as described in claim 1, it is characterised in that: in the step 4, when When file temperature is less than 0, migrated according to file-related information:

As changFlag=1, hddFlag=1, bdFlag=0, file is stored in disk array, does not access be converted into for a long time Cold data transfers to HDFS CD library module to migrate；

As changFlag=0, hddFlag=1, bdFlag=1, file is restored by HDFS CD server to disk array Data, and without repeating imprinting；

As changFlag=1, hddFlag=1, bdFlag=1, file is restored by HDFS CD server to disk array Data, and by the imprinting again of HDFS CD library module, and original the file information is covered；

4. a kind of storage method based on HDFS CD server as claimed in claim 3, it is characterised in that: the storage method is also Including step 5: carrying out the judgement of migrated file size, if file is greater than 128M, file is known as big file, then extremely by big file uploading Imprinting is carried out in HDFS CD server；If file is less than 128M, file is known as small documents, then is migrated by small documents merging.

5. a kind of storage method based on HDFS CD server as claimed in claim 4, it is characterised in that: in the step 5, also Including the file information catalogue being inserted into small documents information in HDFS CD library module.

6. a kind of storage method based on HDFS CD server as claimed in claim 5, it is characterised in that: in the step 5, also All Files size carries out small if file is greater than 128M under second level tag directory including judging new insertion small documents information File mergences processing；If being less than 128M, it is not processed.

7. a kind of storage method based on HDFS CD server as claimed in claim 4, it is characterised in that: in the step 5, also It include: that big file index catalogue and big filename are uploaded to NameNode node in HDFS CD server, NameNode node exists Node saves big file index catalogue and big filename；The behaviour in service and record information of current DataNode node are inquired, is selected The communication that optimal DataNode node is established between HDFS CD library module is selected, specific file imprinting operation is carried out.

8. a kind of storage method based on HDFS CD server as claimed in claim 7, it is characterised in that: the storage method is also Include: when small documents are accessed, or with small documents be located at other small documents under same big file it is accessed when, pass through Filename, big file index catalogue and the big file storage location of big file belonging to NameNode querying node small documents are established DataNode node establishes the communication between HDFS CD library module, and large file and index list are restored to disk sky Between.

9. a kind of storage method based on HDFS CD server as claimed in claim 8, it is characterised in that: the storage method is also Including step 10: big file being reduced to small documents according to big file index catalogue, and small documents hot value is updated to initial value.