CN109271361B - Distributed storage method and system for massive small files - Google Patents

Distributed storage method and system for massive small files Download PDF

Info

Publication number
CN109271361B
CN109271361B CN201810918747.3A CN201810918747A CN109271361B CN 109271361 B CN109271361 B CN 109271361B CN 201810918747 A CN201810918747 A CN 201810918747A CN 109271361 B CN109271361 B CN 109271361B
Authority
CN
China
Prior art keywords
data
directory
small files
storage
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810918747.3A
Other languages
Chinese (zh)
Other versions
CN109271361A (en
Inventor
唐鹏
谢彬
解维
居晓清
张楠
侯亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 32 Research Institute
Original Assignee
CETC 32 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 32 Research Institute filed Critical CETC 32 Research Institute
Priority to CN201810918747.3A priority Critical patent/CN109271361B/en
Publication of CN109271361A publication Critical patent/CN109271361A/en
Application granted granted Critical
Publication of CN109271361B publication Critical patent/CN109271361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a distributed storage method and a distributed storage system for massive small files, which comprise the steps of forming a plurality of virtual disks by a physical disk through logical partitioning; classifying according to file naming rules of the massive small files, and creating a directory index tree; analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the file names in a directory index tree, storing the massive small files, marking the corresponding storage directories as original directories, performing redundant storage on the massive small files, and marking the corresponding storage directories as redundant directories; and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation. Aiming at the distributed storage of a large number of small files, the invention realizes the rapid positioning of information in mass data by establishing indexes and file name positioning when storing a large number of small files, ensures the reliability by utilizing data redundancy, has simple mechanism and strong fault tolerance, and reduces the storage space of metadata while ensuring the correctness.

Description

Distributed storage method and system for massive small files
Technical Field
The invention relates to the field of distributed storage, in particular to a distributed storage method and a distributed storage system for massive small files, and particularly relates to a distributed data organization method applied to a typhoon analysis system.
Background
With the progress of science and technology, human society is entering an era of digital information explosion. The popularization of the internet brings a new growth point for the development of the traditional industry, various industries start to transform to informatization, and the total amount of data is growing in geometric progression. In the face of large explosions of data, the first problem to be solved is the problem of how to store data efficiently. It is clear that the conventional stand-alone storage model has far from meeting the realistic requirements. Distributed storage becomes a necessary requirement for modern information storage.
The meteorological field is also in the process of informatization, typhoon analysis is an important branch of the meteorological field, and the problems such as optimized storage and how to quickly locate information are also faced when data storage and analysis are carried out. Unlike the common industry, the original data of typhoon comes from a large number of satellite clouds obtained by investigation of various satellites and path information of typhoon which is artificially counted. The data itself belongs to small files, the size is usually less than 10M, and through accumulation of many years, the number of cloud pictures reaches tens of millions, and the data volume reaches the T level or even the P level. How to effectively store the data and how to quickly locate information in the data becomes a problem to be solved urgently in the informatization of the meteorological field.
When the meteorological station system data has the conditions of large data volume, data isomerism and serious data noise, the data should be stored quickly and efficiently, and the utilization rate of storage resources is improved; meanwhile, the time for data retrieval should be reduced, so that specific information can be quickly located in mass data. Secondly, because the significance of the data is extraordinary, the data has important value for the prediction and the positioning of the future typhoon, the disaster resistance and the timely recovery capability of the system can be ensured besides the speed of data storage and acquisition in a storage system. Common distributed file systems such as HDFS, GFS, etc. are well demonstrated above using a large number of individual machines to create a clustered environment. However, these distributed file systems have a common problem in that they are inefficient in storing small files. For example, when the HDFS stores a file, the HDFS uses a block as a basic unit, the default block size is 64M, and when a file is stored, if the size of the file is larger than the block size, the system performs a slicing operation on the file, so that the file can be stored in a plurality of blocks; when the size of a file is much smaller than the size of a block, HDFS also occupies the entire block during storage processing. This results in a situation where the HDFS has a serious waste of resources when storing a large number of small files. Therefore, the common distributed file system cannot well solve the problems of data storage and data positioning in the typhoon analysis system.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a distributed storage method and a distributed storage system for massive small files.
The distributed storage method of the massive small files provided by the invention comprises the following steps:
partitioning a disk: forming a plurality of virtual disks by the physical disks through logical partitioning;
establishing a directory index tree: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;
a file storage step: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;
data synchronization and recovery step: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.
Preferably, the disk partitioning step includes:
and a disc numbering step: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, …, N;
a step of logic partitioning: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.
Preferably, the step of creating the directory index tree includes:
establishing a hierarchical node: recording the number of the classified categories as P, and recording the number of class members of the ith classification as Pi, wherein i is a positive integer, the level of the directory index tree is determined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the logical belonged relationship of the classification, and the number of the nodes of each layer is the same as the corresponding Pi in size;
determining an original directory storage partition step: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
The invention also provides a distributed storage system of the mass small files, which comprises the following components:
a disk partitioning module: forming a plurality of virtual disks by the physical disks through logical partitioning;
the module for establishing the directory index tree comprises the following steps: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;
a file storage module: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;
the data synchronization and recovery module: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.
Preferably, the disk partitioning module includes:
a disk numbering module: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N;
a logical partitioning module: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.
Preferably, the module for establishing the directory index tree includes:
establishing a hierarchical node module: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi;
determining an original directory storage partition module: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition module: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
Preferably, the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:
f ═ F (category)% n + year/n + month/n)/3,
the storage partition number F' is calculated according to the following formula:
f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,
wherein the category is the sum of the ASCII code values of each letter of the first characteristic value, the second characteristic value, and the third characteristic value.
Preferably, the data storage is that the names of the massive small files are firstly analyzed, the data types, the year information and the month information of the massive small files are obtained, the original directories of the massive small files are found in the directory index tree for storage, and the duplicate files of the massive small files are stored in the redundant directories;
the data searching comprises the steps of firstly selecting the data type and date of a file to be searched, analyzing the date, obtaining the year and month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, date and month as query conditions, and obtaining the file to be searched.
Preferably, the trigger data synchronization and recovery are performed by a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.
Compared with the prior art, the invention has the following beneficial effects:
1. a distributed storage system is established by utilizing the principle of block storage to store data, and meanwhile, according to the file name of the cloud picture data, the information is quickly positioned in the mass data by establishing an index;
2. in the aspect of disaster resistance of the storage system, the invention adopts a multi-packet redundancy strategy to group the distributed clusters, and when storing data, the data is stored in a specified directory according to the index tree and is also stored in another packet, so that the storage system has stronger fault tolerance;
3. the method provides a solution for storing massive small files, directly utilizes file names to acquire the related information of specific files, and reduces the storage space of metadata while ensuring the correctness.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a diagram of a directory index tree structure according to the present invention;
FIG. 2 is an overall architecture diagram of the system of the present invention.
The figures show that:
root, the first layer root node; MWHS: a second level humidity data node; MWTS: a second tier temperature data node; VIRR: a second tier of infrared data nodes; node 1: a block 1; node 2: a block 2; node 3: a block 3; node 4: a block 4; node 5: a block 5; node 6: a block 6; node 7: a block 7; node 8: a block 8; node 9: and 9. block.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention discloses a distributed storage method of massive small files, which comprises the following steps: partitioning a disk: forming a plurality of virtual disks by the physical disks through logical partitioning; establishing a directory index tree: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification; a file storage step: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories; data synchronization and recovery step: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.
Specifically, the disk partitioning step includes: and a disc numbering step: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N; a step of logic partitioning: dividing and defining m numbered physical disks into a virtual disk, dividing the physical disk into N/m upward virtual disks, and recording the N/m upward virtual disks as N logical partitions;
specifically, the step of creating the directory index tree includes: establishing a hierarchical node: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi; determining an original directory storage partition step: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
Specifically, the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:
f ═ F (category)% n + year/n + month/n)/3,
the storage partition number F' is calculated according to the following formula:
f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,
wherein the category is the sum of the ASCII code values of each letter of the first characteristic value, the second characteristic value, and the third characteristic value.
Specifically, the data storage comprises the steps of firstly analyzing the names of the massive small files, obtaining the data types, the year information and the month information of the massive small files, finding the original directories of the massive small files in a directory index tree for storage, and storing the duplicate files of the massive small files in a redundant directory; the data searching comprises the steps of firstly selecting the data type and date of a file to be searched, analyzing the date, obtaining the year and month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, date and month as query conditions, and obtaining the file to be searched. The triggering data synchronization and recovery are executed through a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.
The invention also discloses a distributed storage system of the mass small files, which comprises the following components: a disk partitioning module: forming a plurality of virtual disks by the physical disks through logical partitioning; the module for establishing the directory index tree comprises the following steps: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification; a file storage module: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories; the data synchronization and recovery module: and when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation.
Specifically, the disk partitioning module includes: a disk numbering module: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N; a logical partitioning module: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.
Specifically, the module for establishing the directory index tree includes: establishing a hierarchical node module: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi; determining an original directory storage partition module: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition module: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
The distributed storage system of the mass small files can be realized through the step flow of the distributed storage method of the mass small files. Those skilled in the art can understand the distributed storage method of the massive small files as a preferred example of the distributed storage system of the massive small files.
When the distributed storage method is applied when typhoon analysis is carried out on the typhoon cloud picture, the file name of the typhoon cloud picture contains huge information, namely information such as data sources, specific data acquisition time and the like, so that the storage of metadata is greatly reduced when the file stores data, and the related information of specific files is directly acquired by using the file name, so that the metadata storage space can be reduced while the correctness is ensured. Therefore, firstly, according to the block storage technology, the disks of the machine are organized, and n virtual disks providing services to the outside are formed in the form of logical partitions. The disks of the machine are organized using block storage techniques and all disks are numbered 1 through N. The m disks are divided into one partition, so that the whole logic partition is taken in N/m direction, and the number of the logic partitions is assumed to be N. The logical partitioning function of the disk provides a guarantee for subsequent data redundancy. And establishing a directory index tree, and dividing the typhoon cloud pictures into three categories according to different shooting satellites, namely humidity data (MWHS), temperature data (MWTS) and infrared data (VIRR). And establishing an index tree of the storage directory according to the three data names and the year information of the three data names. As shown in fig. 1, the index tree is divided into 5 levels, the first level is the root node, and the second level is three nodes, i.e. MWHS, MWTS, VIRRN. Each two-layer node has 18 child nodes, namely, the data directories from 2000 to 2017 form node information of the third layer. Each three-level node has 12 child nodes, namely, the information of 12 months forms a fourth-level node. And a storage directory and a redundant directory for directly storing the cloud picture information of the typhoon under the directory of the month. The storage calculation method of the file directory comprises the following steps: f ═ F (class)% n + year/n + month/n)/3, F (class) is the sum of the ASCII code values for each letter of the three classes. The partition calculation method of the redundant directory is based on the following steps: f ═ F (category)% (n-1) + year/(n-1) + month/(n-1))/3, i.e., after removing the partitions of the original catalog, the remaining n-1 partitions are renumbered and then solved in the same way. The method ensures that the original data and the redundant data are not in the same partition, and because different partitions are composed of different disks, the original data and the redundant data are not in the same disk, and the disaster resistance of the system is ensured.
The file storage process of the typhoon cloud picture is that the file is stored in a set directory by analyzing the name and the year information of the file, meanwhile, the data is subjected to redundancy processing according to a certain rule, and a copy of the data is stored in the designated directory. Firstly, analyzing the name of a file, obtaining the data type, year and month information of the file, then finding the storage position of the file according to the established index tree, and storing the storage position in the file; and meanwhile, obtaining a copy directory of the file and storing the copy. The searching process is to search for location according to the name of the analyzed file and then according to the index tree. The search condition of the data is the data category and date. The date is firstly analyzed to obtain the year and month of the required data, and then the storage directory of the data is found on the directory tree by taking the data type, the year and the month as conditions to obtain the data. A monitor device is provided in the system for data synchronization and data recovery. When the monitor device finds that the data of the redundant directory lacks the data of the original directory, the synchronous operation of the system is triggered, and the missing data is copied to the redundant directory. When the monitor finds that the original directory data is lost, the data recovery function of the system is triggered, and the data of the redundant directory is copied to the original directory.
In the typhoon cloud picture data processing, the method can be utilized for storing a large number of cloud pictures, firstly, the related technology of block storage is utilized to organize the disks of different machines, and provide logical partitions for storing data, then an index tree is established, and a specific storage directory and a redundant directory of data are planned, so that the storage, the search, the synchronization and the recovery of the cloud picture data are realized; : in the small file processing, the method can be used for storing massive small files. The block storage provides a distributed environment, and then an index tree is established according to specific rules, so that the storage, the search, the synchronization and the recovery of small files are realized.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A distributed storage method for massive small files is characterized by comprising the following steps:
partitioning a disk: forming a plurality of virtual disks by the physical disks through logical partitioning;
establishing a directory index tree: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;
a file storage step: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;
data synchronization and recovery step: when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation;
the step of establishing the directory index tree comprises the following steps:
establishing a hierarchical node: recording the number of the classified categories as P, and recording the number of class members of the ith classification as Pi, wherein i is a positive integer, the level of the directory index tree is determined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the logical belonged relationship of the classification, and the number of the nodes of each layer is the same as the corresponding Pi in size;
determining an original directory storage partition step: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
2. The distributed storage method of the small mass files according to claim 1, wherein the disk partitioning step includes:
and a disc numbering step: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, …, N;
a step of logic partitioning: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.
3. The distributed storage method of a large number of small files according to claim 1, wherein the feature value a is a data type of a file, the data type is a first feature value, a second feature value, and a third feature value, the feature value B is year information, the feature value C is month information, and the storage partition number F is calculated according to the following formula:
f ═ F (category)% n + year/n + month/n)/3,
the storage partition number F' is calculated according to the following formula:
f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,
wherein the category is the sum of the ASCII code values of each letter of the first characteristic value, the second characteristic value, and the third characteristic value.
4. The distributed storage method of the massive small files according to claim 3, wherein the data storage for storing the massive small files is to firstly analyze the names of the massive small files, obtain the data types, the year information and the month information of the massive small files, find the original directories of the massive small files in the directory index tree for storage, and store the duplicate files of the massive small files in the redundant directories;
the data searching for storing the massive small files comprises the steps of firstly selecting the data type and the date of the file to be searched, analyzing the date, obtaining the year and the month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, the date and the month as query conditions, and obtaining the file to be searched.
5. The distributed storage method for the massive small files according to claim 1, wherein the triggering data synchronization and recovery are performed by a monitor device, the monitor device regularly checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously regularly checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.
6. A distributed storage system for a large number of small files, comprising:
a disk partitioning module: forming a plurality of virtual disks by the physical disks through logical partitioning;
the module for establishing the directory index tree comprises the following steps: classifying according to file naming rules of the massive small files based on the virtual disk, and creating a directory index tree according to the classification;
a file storage module: analyzing the file names of the stored massive small files according to a file naming rule, acquiring the storage positions of the massive small files in the directory index tree, storing the massive small files, recording the corresponding storage directories as original directories, performing redundant storage on the massive small files, and recording the corresponding storage directories as redundant directories;
the data synchronization and recovery module: when the file data in the original directory and the redundant directory are inconsistent, triggering data synchronization and recovery operation;
the module for establishing the directory index tree comprises:
establishing a hierarchical node module: recording the number of the classified categories as P, and recording the number of the class members of the ith classification as Pi, wherein i is a positive integer greater than 0, the hierarchy of the directory index tree is defined as a P +1 layer, the 0 th layer is a root node, other layers are defined according to the belonged relation of the classified logics, and the number of the nodes of each layer is the same as the corresponding Pi;
determining an original directory storage partition module: calculating the number of the storage partition according to the file name of the mass small files, defining the file name rule as the combination of a characteristic value A, a characteristic value B and a characteristic value C, and calculating the number F of the storage partition according to the following formula:
f ═ F (eigenvalue a)% n + eigenvalue B/n + eigenvalue C/n)/P1
Wherein f (eigenvalue A) is the sum of the code values of the eigenvalue A category names,% is modulus operation, P1The number of nodes in the first layer is n, and the number of the virtual disks is n;
determining a redundant directory storage partition module: excluding the logical partitions of the original directory, renumbering the remaining n-1 logical partitions, and calculating the storage partition number F' according to the following formula:
f ═ F (characteristic A)% (n-1) + characteristic B/(n-1) + characteristic C/(n-1))/P1
Wherein f (characteristic value A) is the code of the category name of the characteristic value ASum of values,% is modulo operation, P1The number of nodes in the first layer is n-1, and the number of virtual disks after the original directory is removed is n-1.
7. The distributed storage system for the mass small files according to claim 6, wherein the disk partitioning module comprises:
a disk numbering module: numbering the physical disks, and recording the physical disks as numbered physical disks, wherein the number is i, i is 1,2, … N;
a logical partitioning module: and dividing m numbered physical disks into a virtual disk, wherein the physical disk is divided into N/m upward whole virtual disks, and the N/m upward whole virtual disks are recorded as N logical partitions.
8. The distributed storage system for a large number of small files according to claim 6, wherein the characteristic value a is a data type of a file, the data type is a first characteristic value, a second characteristic value, and a third characteristic value, the characteristic value B is year information, the characteristic value C is month information, and the storage partition number F is calculated according to the following formula:
f ═ F (category)% n + year/n + month/n)/3,
the storage partition number F' is calculated according to the following formula:
f' ((F (type))% (n-1) + year/(n-1) + month/(n-1))/3,
wherein the category is the sum of the ASCII code values of each letter of the first characteristic value, the second characteristic value, and the third characteristic value.
9. The distributed storage system of the massive small files according to claim 8, wherein the data storage of the massive small files is to firstly analyze the names of the massive small files, obtain the data types, the year information and the month information of the massive small files, find the original directories of the massive small files in the directory index tree for storage, and store the duplicate files of the massive small files in the redundant directories;
the data searching for storing the massive small files comprises the steps of firstly selecting the data type and the date of the file to be searched, analyzing the date, obtaining the year and the month of the data to be searched, searching the storage directory of the file to be searched in the directory index tree by taking the data type, the date and the month as query conditions, and obtaining the file to be searched.
10. The distributed storage system for the mass small files according to claim 6, wherein the trigger data synchronization and recovery are performed by a monitor device, the monitor device periodically checks whether the data of the redundant directory is consistent with the data of the original directory, if the data of the redundant directory is less than the data of the original directory, the original directory data is synchronized to the redundant directory, otherwise, the monitor device continuously periodically checks whether the data of the redundant directory is consistent with the data of the original directory; the monitor device checks whether the original catalog data is lost at regular time, if the data of the original catalog is less than the redundant catalog, the redundant catalog data is restored to the original catalog, otherwise, the monitor device continues to check whether the original catalog data is lost at regular time.
CN201810918747.3A 2018-08-13 2018-08-13 Distributed storage method and system for massive small files Active CN109271361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810918747.3A CN109271361B (en) 2018-08-13 2018-08-13 Distributed storage method and system for massive small files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810918747.3A CN109271361B (en) 2018-08-13 2018-08-13 Distributed storage method and system for massive small files

Publications (2)

Publication Number Publication Date
CN109271361A CN109271361A (en) 2019-01-25
CN109271361B true CN109271361B (en) 2020-07-24

Family

ID=65153772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810918747.3A Active CN109271361B (en) 2018-08-13 2018-08-13 Distributed storage method and system for massive small files

Country Status (1)

Country Link
CN (1) CN109271361B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110515909B (en) * 2019-08-29 2022-05-13 北京字节跳动网络技术有限公司 File storage method and device, electronic equipment and computer storage medium
CN110795284B (en) * 2019-10-25 2022-03-22 浪潮电子信息产业股份有限公司 Data recovery method, device and equipment and readable storage medium
CN112084250A (en) * 2020-09-15 2020-12-15 深圳市宝能投资集团有限公司 Data storage method, data query method and electronic equipment
CN114442937B (en) * 2021-12-31 2023-03-28 北京云宽志业网络技术有限公司 File caching method and device, computer equipment and storage medium
CN114647388B (en) * 2022-05-24 2022-08-12 杭州优云科技有限公司 Distributed block storage system and management method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582064A (en) * 2008-05-15 2009-11-18 阿里巴巴集团控股有限公司 Method and system for processing enormous data
CN101795211A (en) * 2010-01-13 2010-08-04 北京中创信测科技股份有限公司 Data storage method and system
CN103020315A (en) * 2013-01-10 2013-04-03 中国人民解放军国防科学技术大学 Method for storing mass of small files on basis of master-slave distributed file system
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN105279278A (en) * 2015-11-13 2016-01-27 珠海市君天电子科技有限公司 File searching method and device
KR20160067289A (en) * 2014-12-03 2016-06-14 충북대학교 산학협력단 Cache Management System for Enhancing the Accessibility of Small Files in Distributed File System
CN106933984A (en) * 2017-02-20 2017-07-07 周长英 The dispatching method and system of a kind of distributed file system
CN107168657A (en) * 2017-06-15 2017-09-15 深圳市云舒网络技术有限公司 It is a kind of that cache design method is layered based on the virtual disk that distributed block is stored
CN107291876A (en) * 2017-06-19 2017-10-24 华中科技大学 A kind of DDM method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582064A (en) * 2008-05-15 2009-11-18 阿里巴巴集团控股有限公司 Method and system for processing enormous data
CN101795211A (en) * 2010-01-13 2010-08-04 北京中创信测科技股份有限公司 Data storage method and system
CN103020315A (en) * 2013-01-10 2013-04-03 中国人民解放军国防科学技术大学 Method for storing mass of small files on basis of master-slave distributed file system
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
KR20160067289A (en) * 2014-12-03 2016-06-14 충북대학교 산학협력단 Cache Management System for Enhancing the Accessibility of Small Files in Distributed File System
CN105279278A (en) * 2015-11-13 2016-01-27 珠海市君天电子科技有限公司 File searching method and device
CN106933984A (en) * 2017-02-20 2017-07-07 周长英 The dispatching method and system of a kind of distributed file system
CN107168657A (en) * 2017-06-15 2017-09-15 深圳市云舒网络技术有限公司 It is a kind of that cache design method is layered based on the virtual disk that distributed block is stored
CN107291876A (en) * 2017-06-19 2017-10-24 华中科技大学 A kind of DDM method

Also Published As

Publication number Publication date
CN109271361A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109271361B (en) Distributed storage method and system for massive small files
JP7410181B2 (en) Hybrid indexing methods, systems, and programs
Li et al. A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce
CN108595664B (en) Agricultural data monitoring method in hadoop environment
US11977532B2 (en) Log record identification using aggregated log indexes
US8311982B2 (en) Storing update data using a processing pipeline
Malensek et al. Galileo: A framework for distributed storage of high-throughput data streams
CN104239377A (en) Platform-crossing data retrieval method and device
Pallickara et al. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections
Strzelczak et al. Concurrent Deletion in a Distributed {Content-Addressable} Storage System with Global Deduplication
CN112328702B (en) Data synchronization method and system
CN110727406A (en) Data storage scheduling method and device
CN103279489A (en) Method and device for storing metadata
CN112597348A (en) Method and device for optimizing big data storage
CN114741368A (en) Log data statistical method based on artificial intelligence and related equipment
KR101688629B1 (en) Method and apparatus for recovery of file system using metadata and data cluster
US10540329B2 (en) Dynamic data protection and distribution responsive to external information sources
Wu et al. Grid Collector: Using an event catalog to speed up user analysis in distributed environment
CN107818126B (en) Full-text information retrieval method oriented to Mongo database
Wang et al. The method of cloudizing storing unstructured LiDAR point cloud data by MongoDB
CN113553320B (en) Data quality monitoring method and device
Cugnasco et al. The OTree: Multidimensional indexing with efficient data sampling for HPC
US11861206B1 (en) Garbage collection for object-based storage systems
Kvet Referencing validity assignment using B+ tree index enhancements
Cugnasco A framework for multidimensional indexes on distributed and highly-available data stores

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant