CN117807045A

CN117807045A - Multistage file system and construction method thereof

Info

Publication number: CN117807045A
Application number: CN202410232886.6A
Authority: CN
Inventors: 郑前武
Original assignee: Sco Digital Technology Co ltd
Current assignee: Sco Digital Technology Co ltd
Priority date: 2024-03-01
Filing date: 2024-03-01
Publication date: 2024-04-02
Anticipated expiration: 2044-03-01

Abstract

The invention discloses a multi-level file system and a construction method thereof, and belongs to the technical field of file storage. The system comprises: the file heat service module is used for collecting metadata information of files written into the primary storage equipment and capacity information of the primary storage equipment, calculating a file heat value and dividing the files into hot files and cold files according to the file heat value; the primary storage device is used for executing new file data writing and hot file reading; a secondary storage device for storing cold files; and the file migration synchronous service module is used for migrating the cold file on the primary storage device to the secondary storage device, and continuously migrating the file with the smallest file heat value to the secondary storage device if the residual capacity on the primary storage device is smaller than the preset reserved capacity threshold after the cold file migration is completed. The invention realizes the separation and separate storage of cold and hot files, can keep the long-time high-efficiency operation of the primary storage equipment, has stable capacity, and takes account of the file access performance and economy.

Description

Multistage file system and construction method thereof

Technical Field

The invention belongs to the technical field of file storage, and particularly relates to a multi-level file system and a construction method thereof.

Background

A file storage system is a system for storing, organizing, and managing file data, and mainly includes a storage medium, a file system, and a file access interface. It provides a structured way to store and access files so that users and applications can conveniently manage data. The file system is a mechanism used by computers to organize and store data, defines the storage format, access mode and management rules of the data, is used for organizing, storing, accessing and managing the data, and has the functions of security and authority control, fault tolerance and recovery.

In a service scenario with higher file reading and writing speed requirement, only a better storage device can be selected to improve the storage capacity, and because higher disk IO writing performance is generally required, an SSD high-speed hard disk must be allocated to the storage device, and meanwhile, in order to ensure high reliability, a plurality of storage devices must be allocated. Generally, the storage mode has higher cost, limited expansion capability and limited bearing capacity, and is difficult to meet the requirements of business scenes with higher requirements on file reading and writing speeds.

Therefore, a file system which is convenient to expand and can adapt to service scenes with high requirements on file reading and writing speeds is needed.

Disclosure of Invention

In view of this, the present invention provides a multi-level file system and a method for constructing the same, which are used for solving the problem that the existing file storage system cannot be well adapted to the service scenario with high requirement on the file read-write speed.

In a first aspect of the present invention, a multi-level file system is disclosed, the system comprising:

file heat service module: the method comprises the steps of collecting metadata information of files written into primary storage equipment, calculating file heat values according to the metadata information of the files and capacity information of the primary storage equipment, and dividing the files into hot files and cold files according to the file heat values;

primary storage device: for performing new file data writes and hot file reads;

secondary storage device: for storing cold files;

file migration synchronization service module: the method comprises the steps of transferring cold files on a primary storage device to a secondary storage device, continuously transferring files with the smallest file heat value to the secondary storage device if the residual capacity on the primary storage device is smaller than a preset reserved capacity threshold after the cold files are transferred, and deleting the corresponding files on the primary storage device after the transfer is completed;

a file access module: for providing write capability of primary storage devices through NFS services, and read capability of primary and secondary storage devices using the nginix service.

On the basis of the above technical solution, preferably, the system further includes:

asynchronous compression module: and in the process of executing the file migration synchronous service module, when the file size of the cold file to be migrated is larger than a preset file occupation threshold value and the file type is in a preset file type range, asynchronously compressing the cold file to be migrated and then storing the cold file to be migrated into the secondary storage device.

On the basis of the technical scheme, preferably, the primary storage device adopts SSD hard disk device, and the capacity of the secondary storage device is larger than that of the primary storage device;

the primary storage device and the secondary storage device adopt a dual-computer mutual standby mode.

On the basis of the above technical solution, preferably, the metadata information of the file includes: file size, file write time stamp, file modification time stamp, file access time stamp, file read frequency and preset file main path weight;

the capacity information of the primary storage device includes: the current remaining capacity and the total storage capacity of the primary storage device.

On the basis of the above technical solution, preferably, calculating the file heat value according to metadata information of the file and capacity information of the primary storage device, and dividing the file into a hot file and a cold file according to the file heat value specifically includes:

the file heat value H is calculated using the following formula:

wherein,w ₁ 、w ₂ 、w ₃ 、w ₄ 、w ₅ andw ₆ are all the weight coefficients of the two-dimensional space model,η _f for the frequency of file reading,η _max for a preset maximum reading frequency,μ _f the weight of the main path of the file,μ _max The maximum weight of the main path of the preset file,tIs the current time,T _r Is the file access time,T _m A file modification time,T _w Write time for file,c _,t1 Is the current residual capacity of the primary storage device,C ₁ The total storage capacity of the primary storage device;

and when the file heat value H is larger than a preset heat threshold, the file is a hot file, and otherwise, the file is a cold file.

On the basis of the above technical solution, preferably, the preset file type range is composed of a plurality of preset file types;

the file type of the cold file to be migrated is determined according to the suffix name;

and for the file without the file suffix, performing file type identification through a pre-trained neural network model.

On the basis of the technical scheme, preferably, the file access module reads the target file from the primary storage device preferentially;

if the target file cannot be found on the primary storage device, automatically transferring to a corresponding directory of the secondary storage device to find the target file;

if the searched target file is a file which is already compressed, triggering decompression processing, and reading the decompressed file in a file stream mode.

On the basis of the technical scheme, preferably, the multi-level file system is applied to a Kubernetes cluster scene;

when a user requests to write file data, forwarding the user request to POD service in the Kubernetes cluster;

the POD service forwards the user request to the file access module, and the content written by the user request is written into the primary storage device through the NFS service;

the file access module asynchronously informs the metadata information of the file to the file heat service module, and updates the metadata information of the file;

when a user requests to read a target file, forwarding the user request to POD service in a Kubernetes cluster;

the POD service forwards the user request to a file access module of a multi-level file system, and searches a target file of the user request through the file access module;

the file access module asynchronously informs the metadata information of the file to the file heat service module, and updates the metadata information of the file.

The invention discloses a construction method of a multi-level file system, which comprises the following steps:

dividing the file system into a secondary storage structure, wherein the secondary storage structure comprises a primary storage device and a secondary storage device;

writing new file data into the primary storage device, collecting metadata information of the written file and capacity information of the primary storage device, calculating a file heat value according to the metadata information of the file and the capacity information of the primary storage device, and dividing the written file into a hot file and a cold file according to the file heat value;

migrating the cold file on the primary storage device to the secondary storage device, if the residual capacity on the primary storage device is smaller than the preset reserved capacity threshold after the cold file migration is completed, continuously migrating the file with the smallest file heat value to the secondary storage device, and deleting the corresponding file on the primary storage device after the migration is completed;

in the process of migrating the cold files on the primary storage device to the secondary storage device, when the file size of the cold files to be migrated is larger than a preset file occupation threshold value and the file type is in a preset file type range, asynchronously compressing the cold files to be migrated and then storing the cold files to the secondary storage device.

Compared with the prior art, the invention has the following beneficial effects:

1) According to the method, metadata information of files written into the primary storage device and capacity information of the primary storage device are collected, file heat values are calculated according to the information, the written files are divided into hot files and cold files according to the file heat values, separation and separate storage of the cold files and the hot files are achieved, the primary storage device with high-speed access capability is read and written in preferentially, high-speed reading and writing of the files, logs and other data are achieved, the hot files are timely transferred out to the secondary storage device with larger capacity according to the residual capacity after being transferred to cold, the primary storage device can be kept to operate efficiently for a long time, the capacity is stable, and the file access performance and economy are considered.

2) In the process of file migration by the file migration synchronous service module, the file type is automatically identified, when the file size of the cold file to be migrated is larger than the preset file occupation threshold and belongs to the preset file type of the system, the file storage space optimization strategy is triggered, the cold file to be migrated is asynchronously compressed and then stored in the secondary storage device, asynchronous compressed storage is realized, when the file type cannot be directly identified through the file suffix, the file type is accurately identified through the pre-trained neural network model, special file migration and compressed storage are realized, the space occupation of the special file with low access rate on the secondary storage device is reduced, and the space utilization rate of the secondary storage device is further optimized.

3) According to the method and the device, the file heat value is calculated according to the file reading frequency, the file main path weight, the current time, the file access modification and writing situation, the current residual capacity of the primary storage device and other information, the file access frequency and the importance of the file can be considered, the attenuation of the file time factor and the punishment of the residual capacity are considered, different weight coefficients can be configured according to actual situations, the file heat value can be estimated more accurately, the change of the file heat value is tracked timely, the file with the heat value reduced to below the heat threshold value is migrated timely when the residual capacity of the primary storage device is insufficient to the secondary storage device, the efficient operation of the primary storage device and the complete storage of the migrated file are ensured, and file loss is avoided.

4) When the file is searched, the target file is preferentially read from the primary storage equipment with the read-write function, and if the target file cannot be searched on the primary storage equipment, the file is automatically transferred to the corresponding directory of the secondary storage equipment to search the target file; and for the compressed target file, decompression processing is automatically triggered, and the file is read while being decompressed in a file stream mode, so that the file access efficiency is improved.

5) The multi-level file system is convenient to expand and easy to maintain, new secondary storage equipment can be expanded and added at any time, and the purpose of storing mass data files can be achieved; after the new secondary storage equipment is expanded, the primary storage equipment is configured to be synchronous with the new secondary storage equipment, only the file service configuration is modified and one configuration section is added, the corresponding configuration section can be automatically generated according to the maintenance meta-information of the management end, the loading is effective automatically, the original file data is not required to be moved, and the convenience is high.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of the working principle of a multi-level file system according to the present invention in a Kubernetes cluster scenario.

Detailed Description

The following description of the embodiments of the present invention will clearly and fully describe the technical aspects of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

The invention takes a Kubernetes cluster scene as an example to describe a multi-level file system provided by the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a working principle of a multi-level file system provided in the Kubernetes cluster scenario according to the present invention.

The invention proposes a multi-level file system, said system comprising: the system comprises a file heat service module, primary storage equipment, secondary storage equipment, a file migration synchronous service module, an asynchronous compression module and a file access module, wherein a Kubernetes cluster and each storage equipment are connected in a communication mode through optical fibers.

The file heat service module is used for collecting metadata information of files written into the primary storage device and capacity information of the primary storage device, calculating file heat values according to the metadata information of the files and the capacity information of the primary storage device, and dividing the files into hot files and cold files according to the file heat values.

The metadata information of the file mainly comprises: file size, file write time stamp, file modification time stamp, file access time stamp, file read frequency and preset file main path weight; the capacity information of the primary storage device includes: the current remaining capacity and the total storage capacity of the primary storage device.

The formula for calculating the file heat value H according to the metadata information of the file and the capacity information of the primary storage device is as follows:

wherein,w ₁ 、w ₂ 、w ₃ 、w ₄ 、w ₅ andw ₆ are all the weight coefficients of the two-dimensional space model,η _f for the frequency of file reading,η _max for a preset maximum reading frequency,μ _f the weight of the main path of the file,μ _max The maximum weight of the main path of the preset file,tIs the current time,T _r Is the file access time,T _m A file modification time,T _w Write time for file,c _,t1 Is the current residual capacity of the primary storage device,C ₁ Is the total storage capacity of the primary storage device.

In the calculation formula of the file heat value H, the information such as file reading frequency, file main path weight, current time, file access modification and writing-in conditions, current residual capacity of primary storage equipment and the like are considered. The first item on the right divides the reading frequency of the file by the maximum reading frequency, normalizes the file reading frequency to [0, 1 ]]Within the range; file main path weight of second itemμ _f Dividing the file main path weight by the maximum weight of the file main path according to the importance degree setting of the file, and normalizing the file main path weight to 0, 1]Within the range; and thirdly, taking access time, modification time and writing time of the file into consideration, wherein the current time is respectively subtracted from the access time, the modification time and the writing time of the file to introduce attenuation of time factors so as to attenuate the influence of the past time on the file heat value, and the larger the attenuation of the time factors is, the smaller the influence of the past time on the heat value is. The last term in the above formula is a penalty term for storing the remaining capacity, and the lower the remaining capacity is, the larger the penalty term is.

The file heat value calculation mode can give consideration to file access frequency and importance, consider file time factor attenuation and residual capacity punishment, and can also configure different weight coefficients according to actual conditions, so that the file heat value can be more accurately estimated, file heat value change can be tracked in time, and cold and hot file separation can be carried out.

And the primary storage device is used for executing new file data writing and hot file reading. In order to improve the reading and writing speed and reduce the cost, the primary storage device adopts the SSD hard disk device with small capacity and high-speed access capability, and all new file data writing and hot file reading are born by the primary storage device so as to ensure the access performance.

And the secondary storage device is used for storing the cold files. Secondary storage devices may employ relatively inexpensive mass storage devices to provide sufficient storage space while reducing costs.

The primary storage device and the secondary storage device adopt a dual-machine mutual backup mode, dual-machine hot backup tasks are synchronized through the Async, and two or more pieces of same equipment are used for mutual backup and supporting work of each other, so that high availability and reliability of a system are ensured.

And the file migration synchronous service module is used for migrating the cold files on the primary storage device to the secondary storage device, and after the cold files are migrated, if the residual capacity on the primary storage device is smaller than a preset reserved capacity threshold value, continuously migrating the file with the smallest file heat value to the secondary storage device, and deleting the corresponding file on the primary storage device after the migration is completed.

According to the file heat value calculation method, the file heat value is continuously calculated through the file heat value calculation formula in the file heat service module, the file heat value of each file is updated in real time to monitor the heat value change of each file, once the heat value of the file is smaller than the preset heat threshold value, the file is changed into a cold file, the corresponding file is timely migrated to the second-level storage device, the corresponding file on the first-level storage device is deleted, the storage space of the first-level storage device is released, and dynamic separation storage of the cold file and the heat file is realized.

In addition, to ensure the performance of the storage device, the primary storage device generally needs to reserve a certain capacity. Therefore, a reserved capacity threshold value X is preset for the primary storage device, if the residual capacity on the primary storage device is still smaller than the preset reserved capacity threshold value after all cold files on the primary storage device are migrated to the secondary storage device, starting from a file with the smallest file heat value in the primary storage device, continuously migrating the file with the smallest file heat value to the secondary storage device until the residual capacity on the primary storage device is larger than or equal to the preset reserved capacity threshold value.

Asynchronous compression module: and triggering a file storage space optimization strategy when the file size of the cold file to be migrated is larger than a preset file occupation threshold and the file type is in a preset file type range in the process of executing the file migration synchronous service module, and asynchronously compressing the cold file to be migrated through the optimization strategy and then storing the cold file to the secondary storage device.

The asynchronous compression module is mainly used for carrying out special file migration, and presetting the file type of a special file, namely, the preset file type range is composed of a plurality of preset file types.

When the file migration synchronous service module executes file migration, the file type is identified according to the suffix name of the file, when the file is migrated, if the file occupation is overlarge, the file heat value is smaller than the heat threshold value and is a special file, an asynchronous compression instruction is triggered to compress the file, special file migration and compression storage are realized, the space occupation of the special file with low access rate on the secondary storage device is reduced, and the space utilization rate of the secondary storage device is further optimized.

And for the file without the file suffix, carrying out file type identification through a pre-trained neural network model.

Specifically, the neural network model may adopt CNN, RNN, LSTM, etc. models, collect different types of file samples in advance, and convert them into a data format suitable for the neural network model to process, for example, convert the file into corresponding binary data or feature vectors. The file data is then feature extracted so that the neural network can learn and identify features of different types of files and the neural network model is trained using the collected file data so that the model can learn features and differences between different file types. Gradient descent, adam, etc. are used in the training process to speed up the training and improve the performance model of the model. The trained model is evaluated to check its performance on the test data. And (3) optimizing and optimizing the model according to the evaluation result so as to improve the accuracy and the robustness of the file type identification.

And finally, deploying the trained neural network model into practical application, and calling the pre-trained neural network model to identify the file type when the multi-stage file system needs to identify the file type.

A file access module: for providing high-speed writing of files to primary storage devices by services within Kubernetes clusters through NFS or other service programs, providing file reading functions to primary and secondary storage devices using ng ix or other services, and exposing file service capabilities to application programs or cluster nodes.

The file access module reads the target file from the primary storage device preferentially, and if the target file cannot be found on the primary storage device, the file access module automatically transfers to the corresponding directory of the secondary storage device to find the target file; if the searched target file is a compressed file, triggering decompression processing, reading the decompressed file in a file stream mode, and returning while decompressing.

According to the method, metadata information of files written into the primary storage device and capacity information of the primary storage device are collected, file heat values are calculated according to the information, the written files are divided into hot files and cold files according to the file heat values, separation and separate storage of the cold files and the hot files are achieved, the primary storage device with high-speed access capability is read and written in preferentially, high-speed reading and writing of the files, logs and other data are achieved, the hot files are timely transferred out to the secondary storage device with larger capacity according to the residual capacity after being transferred to cold, the primary storage device can be kept to operate for a long time, the available capacity is stable, and the file access performance and economy are considered.

As shown in fig. 1, when the multi-level file system of the present invention is applied in a Kubernetes cluster scenario, the process of obtaining the request file by the user is as follows:

1. the user requests to read the target file;

2. forwarding the user request to the POD service within the Kubernetes cluster;

3. the POD service forwards the user request to a file access module of the multi-level file system;

4. searching a target file requested by a user from primary storage equipment through a file access module;

5. if the target file cannot be found on the primary storage device, automatically transferring to a corresponding directory of the secondary storage device to find the target file, returning to the target file after finding, and if the found target file is a compressed file, triggering decompression processing, and reading the decompressed file in a file stream mode;

6. the file access module asynchronously informs the metadata information of the file to the file heat service module, updates the metadata information of the file, and the file heat service module continuously calculates a file heat value H by adopting a file heat value calculation formula;

7. and the file migration synchronous service module carries out file migration according to the file heat value and the residual capacity of the primary storage device, and transfers the cold files on the primary storage device to the secondary storage device in real time, and continuously transfers the file with the minimum file heat value in the primary storage device to the secondary storage device when the residual capacity of the primary storage device is smaller than a preset reserved capacity threshold value until the residual capacity of the primary storage device is larger than or equal to the preset reserved capacity threshold value.

Likewise, when a user requests to write file data, the user request is forwarded to the POD service within the Kubernetes cluster;

the file access module asynchronously informs the metadata information of the file to the file heat service module, and updates the metadata information of the file. In fig. 1, NODEs are working NODEs in Kubernetes clusters, each having a Kubernetes runtime environment that can host multiple PODs and is responsible for managing the lifecycle, health, and network communications of those PODs. POD is a group of containers, the smallest deployable unit in Kubernetes. Service is used to provide POD exposure services to applications or users so that applications or users can access these PODs.

The multi-level file system is convenient to expand and easy to maintain, new secondary storage equipment can be expanded and added at any time, and the purpose of storing mass data files can be achieved; after the new secondary storage equipment is expanded, the primary storage equipment is configured to be synchronous with the new secondary storage equipment, only the file service configuration is modified and one configuration section is added, the corresponding configuration section can be automatically generated according to the maintenance meta-information of the management end, the loading is effective automatically, the original file data is not required to be moved, and the convenience is high.

On the basis of the multi-level file system, the invention also provides a construction method of the multi-level file system, which comprises the following steps:

s1, dividing a file system into a multi-level storage structure, wherein the multi-level storage structure comprises primary storage equipment and secondary storage equipment;

s2, writing new file data into the primary storage device, collecting metadata information of each file on the primary storage device and capacity information of the primary storage device, calculating a file heat value according to the metadata information of the file and the capacity information of the primary storage device, and dividing the file into a hot file and a cold file according to the file heat value;

s3, transferring the cold files on the primary storage device to the secondary storage device, if the residual capacity on the primary storage device is smaller than a preset reserved capacity threshold after the cold files are transferred, continuously transferring the file with the smallest file heat value to the secondary storage device, and deleting the corresponding file on the primary storage device after the transfer is completed;

s4, in the process of migrating the cold files on the primary storage device to the secondary storage device, when the file size of the cold files to be migrated is larger than a preset file occupation threshold value and the file type is in a preset file type range, asynchronously compressing the cold files to be migrated and then storing the cold files to the secondary storage device.

The above method embodiments are implemented based on the system embodiments, and the method embodiments will be briefly described with reference to the system embodiments.

The invention also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete communication with each other through the bus; the memory stores program instructions executable by the processor that the processor invokes to implement the aforementioned methods of the present invention.

The invention also discloses a computer readable storage medium storing computer instructions for causing a computer to implement all or part of the steps of the methods of the embodiments of the invention. The storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, i.e., may be distributed over a plurality of network elements. One of ordinary skill in the art may select some or all of the modules according to actual needs without performing any inventive effort to achieve the objectives of the present embodiment.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A multi-level file system, the system comprising:

file heat service module: the method comprises the steps of collecting metadata information of files written into primary storage equipment and capacity information of the primary storage equipment, calculating file heat values according to the metadata information of the files and the capacity information of the primary storage equipment, and dividing the files into hot files and cold files according to the file heat values;

primary storage device: for performing new file data writes and hot file reads;

secondary storage device: for storing cold files;

2. The multi-level file system of claim 1, wherein the system further comprises:

3. The multi-level file system of claim 2, wherein the primary storage device employs an SSD hard disk device, the secondary storage device having a capacity greater than the primary storage device;

4. The multi-level file system of claim 2, wherein the metadata information of the file comprises: file size, file write time stamp, file modification time stamp, file access time stamp, file read frequency and preset file main path weight;

5. The multi-level file system of claim 4, wherein the calculating the file heat value according to the metadata information of the file and the capacity information of the primary storage device, and the classifying the file into the hot file and the cold file according to the file heat value specifically comprises:

the file heat value H is calculated using the following formula:

；

6. The multi-level file system of claim 2 wherein the predetermined range of file types consists of a predetermined plurality of file types;

7. The multi-level file system of claim 1, wherein the file access module preferentially reads the target file from the primary storage device;

8. The multi-level file system of claim 7, wherein the multi-level file system is applied in a Kubernetes cluster scenario;

when a user requests to read a target file, forwarding the user request to POD service in the Kubernetes cluster;

the POD service forwards the user request to a file access module of a multi-level file system, and searches a target file of the user request through the file access module; the file access module asynchronously informs the metadata information of the file to the file heat service module, and updates the metadata information of the file.

9. A method of constructing a multi-level file system, the method comprising:

dividing a file system into a multi-level storage structure, wherein the multi-level storage structure comprises a primary storage device and a secondary storage device;

writing new file data into the primary storage device, collecting metadata information of each file on the primary storage device and capacity information of the primary storage device, calculating a file heat value according to the metadata information of the file and the capacity information of the primary storage device, and dividing the file into a hot file and a cold file according to the file heat value;