CN109271361B

CN109271361B - Distributed storage method and system for massive small files

Info

Publication number: CN109271361B
Application number: CN201810918747.3A
Authority: CN
Inventors: 唐鹏; 谢彬; 解维; 居晓清; 张楠; 侯亮
Original assignee: CETC 32 Research Institute
Current assignee: CETC 32 Research Institute
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2020-07-24
Anticipated expiration: 2038-08-13
Also published as: CN109271361A

Abstract

The invention provides a distributed storage method and a distributed storage system for massive small files, which comprise the steps of forming a plurality of virtual disks by a physical disk through logical partitioning; classifying according to file naming rules of the massive small files, and creating a directory index tree; analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the file names in a directory index tree, storing the massive small files, marking the corresponding storage directories as original directories, performing redundant storage on the massive small files, and marking the corresponding storage directories as redundant directories; and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation. Aiming at the distributed storage of a large number of small files, the invention realizes the rapid positioning of information in mass data by establishing indexes and file name positioning when storing a large number of small files, ensures the reliability by utilizing data redundancy, has simple mechanism and strong fault tolerance, and reduces the storage space of metadata while ensuring the correctness.

Description

Distributed storage method and system for massive small files

Technical Field

The invention relates to the field of distributed storage, in particular to a distributed storage method and a distributed storage system for massive small files, and particularly relates to a distributed data organization method applied to a typhoon analysis system.

Background

With the progress of science and technology, human society is entering an era of digital information explosion. The popularization of the internet brings a new growth point for the development of the traditional industry, various industries start to transform to informatization, and the total amount of data is growing in geometric progression. In the face of large explosions of data, the first problem to be solved is the problem of how to store data efficiently. It is clear that the conventional stand-alone storage model has far from meeting the realistic requirements. Distributed storage becomes a necessary requirement for modern information storage.

The meteorological field is also in the process of informatization, typhoon analysis is an important branch of the meteorological field, and the problems such as optimized storage and how to quickly locate information are also faced when data storage and analysis are carried out. Unlike the common industry, the original data of typhoon comes from a large number of satellite clouds obtained by investigation of various satellites and path information of typhoon which is artificially counted. The data itself belongs to small files, the size is usually less than 10M, and through accumulation of many years, the number of cloud pictures reaches tens of millions, and the data volume reaches the T level or even the P level. How to effectively store the data and how to quickly locate information in the data becomes a problem to be solved urgently in the informatization of the meteorological field.

When the meteorological station system data has the conditions of large data volume, data isomerism and serious data noise, the data should be stored quickly and efficiently, and the utilization rate of storage resources is improved; meanwhile, the time for data retrieval should be reduced, so that specific information can be quickly located in mass data. Secondly, because the significance of the data is extraordinary, the data has important value for the prediction and the positioning of the future typhoon, the disaster resistance and the timely recovery capability of the system can be ensured besides the speed of data storage and acquisition in a storage system. Common distributed file systems such as HDFS, GFS, etc. are well demonstrated above using a large number of individual machines to create a clustered environment. However, these distributed file systems have a common problem in that they are inefficient in storing small files. For example, when the HDFS stores a file, the HDFS uses a block as a basic unit, the default block size is 64M, and when a file is stored, if the size of the file is larger than the block size, the system performs a slicing operation on the file, so that the file can be stored in a plurality of blocks; when the size of a file is much smaller than the size of a block, HDFS also occupies the entire block during storage processing. This results in a situation where the HDFS has a serious waste of resources when storing a large number of small files. Therefore, the common distributed file system cannot well solve the problems of data storage and data positioning in the typhoon analysis system.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a distributed storage method and a distributed storage system for massive small files.

The distributed storage method of the massive small files provided by the invention comprises the following steps:

partitioning a disk: forming a plurality of virtual disks by the physical disks through logical partitioning;

establishing a directory index tree: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;

a file storage step: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;

data synchronization and recovery step: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.

Preferably, the disk partitioning step includes:

and a disc numbering step: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, …, N;

a step of logic partitioning: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.

Preferably, the step of creating the directory index tree includes:

establishing a hierarchical node: recording the number of the classified categories as P, and recording the number of class members of the ith classification as Pi, wherein i is a positive integer, the level of the directory index tree is determined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the logical belonged relationship of the classification, and the number of the nodes of each layer is the same as the corresponding Pi in size;

determining an original directory storage partition step: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P₁The number of nodes in the first layer is n, and the number of the virtual disks is n;

determining a redundant directory storage partition: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:

f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P₁

Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P₁The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.

The invention also provides a distributed storage system of the mass small files, which comprises the following components:

a disk partitioning module: forming a plurality of virtual disks by the physical disks through logical partitioning;

the module for establishing the directory index tree comprises the following steps: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;

a file storage module: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;

the data synchronization and recovery module: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.

Preferably, the disk partitioning module includes:

a disk numbering module: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N;

a logical partitioning module: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.

Preferably, the module for establishing the directory index tree includes:

establishing a hierarchical node module: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi;

determining an original directory storage partition module: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

determining a redundant directory storage partition module: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:

Preferably, the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:

f ═ F (category)% n + year/n + month/n)/3,

the storage partition number F' is calculated according to the following formula:

f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,

wherein the category is the sum of the ASCII code values of each letter of the first characteristic value, the second characteristic value, and the third characteristic value.

Preferably, the data storage is that the names of the massive small files are firstly analyzed, the data types, the year information and the month information of the massive small files are obtained, the original directories of the massive small files are found in the directory index tree for storage, and the duplicate files of the massive small files are stored in the redundant directories;

the data searching comprises the steps of firstly selecting the data type and date of a file to be searched, analyzing the date, obtaining the year and month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, date and month as query conditions, and obtaining the file to be searched.

Preferably, the trigger data synchronization and recovery are performed by a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.

Compared with the prior art, the invention has the following beneficial effects:

1. a distributed storage system is established by utilizing the principle of block storage to store data, and meanwhile, according to the file name of the cloud picture data, the information is quickly positioned in the mass data by establishing an index;

2. in the aspect of disaster resistance of the storage system, the invention adopts a multi-packet redundancy strategy to group the distributed clusters, and when storing data, the data is stored in a specified directory according to the index tree and is also stored in another packet, so that the storage system has stronger fault tolerance;

3. the method provides a solution for storing massive small files, directly utilizes file names to acquire the related information of specific files, and reduces the storage space of metadata while ensuring the correctness.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram of a directory index tree structure according to the present invention;

FIG. 2 is an overall architecture diagram of the system of the present invention.

The figures show that:

root, the first layer root node; MWHS: a second level humidity data node; MWTS: a second tier temperature data node; VIRR: a second tier of infrared data nodes; node 1: a block 1; node 2: a block 2; node 3: a block 3; node 4: a block 4; node 5: a block 5; node 6: a block 6; node 7: a block 7; node 8: a block 8; node 9: and 9. block.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention discloses a distributed storage method of massive small files, which comprises the following steps: partitioning a disk: forming a plurality of virtual disks by the physical disks through logical partitioning; establishing a directory index tree: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification; a file storage step: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories; data synchronization and recovery step: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.

Specifically, the disk partitioning step includes: and a disc numbering step: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N; a step of logic partitioning: dividing and defining m numbered physical disks into a virtual disk, dividing the physical disk into N/m upward virtual disks, and recording the N/m upward virtual disks as N logical partitions;

specifically, the step of creating the directory index tree includes: establishing a hierarchical node: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi; determining an original directory storage partition step: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

Specifically, the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:

f ═ F (category)% n + year/n + month/n)/3,

f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,

Specifically, the data storage comprises the steps of firstly analyzing the names of the massive small files, obtaining the data types, the year information and the month information of the massive small files, finding the original directories of the massive small files in a directory index tree for storage, and storing the duplicate files of the massive small files in a redundant directory; the data searching comprises the steps of firstly selecting the data type and date of a file to be searched, analyzing the date, obtaining the year and month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, date and month as query conditions, and obtaining the file to be searched. The triggering data synchronization and recovery are executed through a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.

The invention also discloses a distributed storage system of the mass small files, which comprises the following components: a disk partitioning module: forming a plurality of virtual disks by the physical disks through logical partitioning; the module for establishing the directory index tree comprises the following steps: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification; a file storage module: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories; the data synchronization and recovery module: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.

Specifically, the disk partitioning module includes: a disk numbering module: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N; a logical partitioning module: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.

Specifically, the module for establishing the directory index tree includes: establishing a hierarchical node module: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi; determining an original directory storage partition module: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

The distributed storage system of the mass small files can be realized through the step flow of the distributed storage method of the mass small files. Those skilled in the art can understand the distributed storage method of the massive small files as a preferred example of the distributed storage system of the massive small files.

When the distributed storage method is applied when typhoon analysis is carried out on the typhoon cloud picture, the file name of the typhoon cloud picture contains huge information, namely information such as data sources, specific data acquisition time and the like, so that the storage of metadata is greatly reduced when the file stores data, and the related information of specific files is directly acquired by using the file name, so that the metadata storage space can be reduced while the correctness is ensured. Therefore, firstly, according to the block storage technology, the disks of the machine are organized, and n virtual disks providing services to the outside are formed in the form of logical partitions. The disks of the machine are organized using block storage techniques and all disks are numbered 1 through N. The m disks are divided into one partition, so that the whole logic partition is taken in N/m direction, and the number of the logic partitions is assumed to be N. The logical partitioning function of the disk provides a guarantee for subsequent data redundancy. And establishing a directory index tree, and dividing the typhoon cloud pictures into three categories according to different shooting satellites, namely humidity data (MWHS), temperature data (MWTS) and infrared data (VIRR). And establishing an index tree of the storage directory according to the three data names and the year information of the three data names. As shown in fig. 1, the index tree is divided into 5 levels, the first level is the root node, and the second level is three nodes, i.e. MWHS, MWTS, VIRRN. Each two-layer node has 18 child nodes, namely, the data directories from 2000 to 2017 form node information of the third layer. Each three-level node has 12 child nodes, namely, the information of 12 months forms a fourth-level node. And a storage directory and a redundant directory for directly storing the cloud picture information of the typhoon under the directory of the month. The storage calculation method of the file directory comprises the following steps: f ═ F (class)% n + year/n + month/n)/3, F (class) is the sum of the ASCII code values for each letter of the three classes. The partition calculation method of the redundant directory is based on the following steps: f ═ F (category)% (n-1) + year/(n-1) + month/(n-1))/3, i.e., after removing the partitions of the original catalog, the remaining n-1 partitions are renumbered and then solved in the same way. The method ensures that the original data and the redundant data are not in the same partition, and because different partitions are composed of different disks, the original data and the redundant data are not in the same disk, and the disaster resistance of the system is ensured.

The file storage process of the typhoon cloud picture is that the file is stored in a set directory by analyzing the name and the year information of the file, meanwhile, the data is subjected to redundancy processing according to a certain rule, and a copy of the data is stored in the designated directory. Firstly, analyzing the name of a file, obtaining the data type, year and month information of the file, then finding the storage position of the file according to the established index tree, and storing the storage position in the file; and meanwhile, obtaining a copy directory of the file and storing the copy. The searching process is to search for location according to the name of the analyzed file and then according to the index tree. The search condition of the data is the data category and date. The date is firstly analyzed to obtain the year and month of the required data, and then the storage directory of the data is found on the directory tree by taking the data type, the year and the month as conditions to obtain the data. A monitor device is provided in the system for data synchronization and data recovery. When the monitor device finds that the data of the redundant directory lacks the data of the original directory, the synchronous operation of the system is triggered, and the missing data is copied to the redundant directory. When the monitor finds that the original directory data is lost, the data recovery function of the system is triggered, and the data of the redundant directory is copied to the original directory.

In the typhoon cloud picture data processing, the method can be utilized for storing a large number of cloud pictures, firstly, the related technology of block storage is utilized to organize the disks of different machines, and provide logical partitions for storing data, then an index tree is established, and a specific storage directory and a redundant directory of data are planned, so that the storage, the search, the synchronization and the recovery of the cloud picture data are realized; : in the small file processing, the method can be used for storing massive small files. The block storage provides a distributed environment, and then an index tree is established according to specific rules, so that the storage, the search, the synchronization and the recovery of small files are realized.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A distributed storage method for massive small files is characterized by comprising the following steps:

data synchronization and recovery step: when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation;

the step of establishing the directory index tree comprises the following steps:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

2. The distributed storage method of the small mass files according to claim 1, wherein the disk partitioning step includes:

3. The distributed storage method of a large number of small files according to claim 1, wherein the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:

f ═ F (category)% n + year/n + month/n)/3,

f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,

4. The distributed storage method of the massive small files according to claim 3, wherein the data storage for storing the massive small files is to firstly analyze the names of the massive small files, obtain the data types, the year information and the month information of the massive small files, find the original directories of the massive small files in the directory index tree for storage, and store the duplicate files of the massive small files in the redundant directories;

the data searching for storing the massive small files comprises the steps of firstly selecting the data type and the date of the file to be searched, analyzing the date, obtaining the year and the month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, the date and the month as query conditions, and obtaining the file to be searched.

5. The distributed storage method for the massive small files according to claim 1, wherein the triggering data synchronization and recovery are performed by a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.

6. A distributed storage system for a large number of small files, comprising:

the data synchronization and recovery module: when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation;

the module for establishing the directory index tree comprises:

f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P₁

Wherein f (characteristic value A) is the code of the category name of the characteristic value ASum of values,% is modulo operation, P₁The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.

7. The distributed storage system for the mass small files according to claim 6, wherein the disk partitioning module comprises:

8. The distributed storage system for a large number of small files according to claim 6, wherein the characteristic value a is a data type of a file, the data type is a first characteristic value, a second characteristic value, and a third characteristic value, the characteristic value B is year information, the characteristic value C is month information, and the storage partition number F is calculated according to the following formula:

f ═ F (category)% n + year/n + month/n)/3,

f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,

9. The distributed storage system of the massive small files according to claim 8, wherein the data storage of the massive small files is to firstly analyze the names of the massive small files, obtain the data types, the year information and the month information of the massive small files, find the original directories of the massive small files in the directory index tree for storage, and store the duplicate files of the massive small files in the redundant directories;

10. The distributed storage system for the mass small files according to claim 6, wherein the trigger data synchronization and recovery are performed by a monitor device, the monitor device periodically checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously periodically checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.