CN114442937B

CN114442937B - File caching method and device, computer equipment and storage medium

Info

Publication number: CN114442937B
Application number: CN202111663107.0A
Authority: CN
Inventors: 高华龙; 冯玉朋
Original assignee: Beijing Yunkuanzhiye Network Technology Co ltd
Current assignee: Beijing Yunkuanzhiye Network Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-03-28
Anticipated expiration: 2041-12-31
Also published as: CN114442937A

Abstract

The invention discloses a file caching method, a file caching device, computer equipment and a storage medium, wherein the method comprises the following steps: responding to a read-write instruction of a target file, and calculating a hash value corresponding to the file name of the target file; obtaining the remainder of the hash value by using the number of the mounted directories under the cache directory; and determining a unique path of the target file under the cache directory based on the remainder of the hash value. By applying the scheme of the invention, the read-write performance of the cached small files can be improved, the server resources can be fully utilized, and the dependence on a file system can be reduced.

Description

File caching method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of disk data storage technologies, and in particular, to a file caching method and apparatus, a computer device, and a storage medium.

Background

The file cache is generally used for rapidly storing a large number of temporarily fragmented files, and is usually switched in from the aspect of disk media in order to improve the performance of the file cache, and a method for improving the performance of the disk media can use a RAID technology to reorganize a plurality of disks besides using a device with higher performance, so as to obtain the stability and performance of the disk media by increasing the number of the disk media. Among them, RAID (Redundant Arrays of Independent Disks) has the meaning of "array with redundancy capability made up of Independent Disks". The disk array is a disk group with huge capacity formed by combining a plurality of independent disks according to different modes, thereby providing higher storage performance than a single disk and providing a data backup technology, and improving the efficiency of the whole disk system by utilizing the additive effect generated by providing data by individual disks. With this technique, data is divided into a plurality of sectors, each of which is stored on a respective hard disk.

In the prior art, the common RAID schemes mainly include RAID0, RAID1, RAID5, RAID6, and RAID10. The RAID0 only improves the read-write performance and does not provide a data redundancy function, the principle is that a plurality of hard disks are combined into a large hard disk, when data are read and written, continuous data are cut into a plurality of blocks according to rules, and the data are written into a plurality of hard disks in a scattered mode to be read and written at the same time, so that the read-write bandwidth is improved, and the read-write speed of the magnetic disks in the same time is improved by N times through the parallel operation of the hard disks. However, since it has no data redundancy function, almost all data is affected if one of the hard disks is destroyed. RAID1 is just opposite to RAID0, only provides data redundancy function, and the read-write performance depends on the hard disk with the smallest bandwidth, and is characterized in that the data security is ensured, and as long as one hard disk is arranged in the hard disk group, the data is secure. RAID5 is the most used scheme at present, and is equivalent to additionally using one hard disk for data verification on the basis of RAID0, so that the problem of data loss due to damage of the hard disk does not occur, and in order to reduce the load of the data verification device, RAID5 writes verification data into all hard disks in turn, which is also a difference between RAID5 and RAID 3. RAID6 mainly increases the number of redundant blocks compared to RAID5, but RAID5 does not improve sequential reading performance for continuous small IO (data IO size is smaller than cache block size) compared to single disk, and writing performance is rather reduced due to the overwriting problem. It can be known from the above conventional RAID technology that a group of disks with the same specification is generally required, and a server may not only generally mount a plurality of disks, but also mount other disks (such as a disk expansion cabinet) in addition through other ways, and if it is desired to use multiple sources of disks to form a cache disk group in the face of a complex device, the RAID technology has some limitations obviously. Because RAID provides a block device and cannot directly provide a read-write interface at a file level, the block device must be managed by an intermediate layer, that is, a file system, and existing file systems are various and have various features except for basic functions of creating, reading, writing, deleting, and the like, some of them can support a larger storage space, some of them support more file numbers, some of them support an ultra-long file name, and some of them support an ultra-large single file. Users often use different file systems according to specific business requirements. Therefore, it is necessary to provide a file caching scheme based on multiple hard disks to solve the technical drawbacks.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in the prior art, the improvement of the read-write performance of the small files is beneficial to a limited extent and even reduced by simply using the RAID technology; the support is not good for the condition that the specifications of the disks are different, which is not beneficial to fully utilizing the server resources; most file systems have problems such as limited support capacity, limited number of files in a single directory, or slow read/write speed when the number of files in a directory increases in response to a cached business scenario. And when the abundant service demands are met, the performance of the RAID technology is difficult to keep in the service scenes with large capacity and large quantity by using the prior RAID technology.

In order to solve the technical problem, the invention provides a file caching method, which comprises the following steps:

responding to a read-write instruction of a target file, and calculating a hash value corresponding to the file name of the target file;

utilizing the number of the mounted catalogues under the cache catalogues to carry out remainder operation on the hash value to obtain the remainder of the hash value;

and determining a unique path of the target file under the cache directory based on the remainder of the hash value.

Optionally, the calculating a hash value corresponding to the filename of the target file includes:

and carrying out hash calculation on the file name of the target file based on a hash algorithm to obtain a hash value corresponding to the file name of the target file.

Optionally, the determining, based on the remainder of the hash value, a unique path of the target file under the cache directory includes:

obtaining the total number of cached files by dividing the total cached capacity by the average volume of the cached files;

determining the number of leaf directories to be cached by dividing the total cached file number by the maximum file number under the directories;

and determining the unique path of the target file under the cache directory according to the number of the leaf directories and the maximum directory level of a file system.

Optionally, the method further comprises:

responding to a read-write instruction aiming at high-speed low-reliability service, and performing disk data read-write operation by using a mode of combining RAID0 and SSD media;

and responding to a read-write instruction aiming at the low-speed high-reliability service, and performing disk data read-write operation by using a mode of combining RAID1 and HDD medium.

In order to solve the above technical problem, the present invention provides a file caching apparatus, including:

the hash calculation module is used for responding to a read-write instruction of a target file and calculating a hash value corresponding to the file name of the target file;

the hash value processing module is used for obtaining the remainder of the hash value by using the number of the mounted directories under the cache directory;

and the caching module is used for determining a unique path of the target file under the caching directory based on the remainder of the hash value.

Optionally, the hash calculation module is configured to:

Optionally, the cache module is configured to:

dividing the total cached file number by the maximum file number under the directories to determine the number of the leaf directories to be cached;

Optionally, the cache module is further configured to:

and responding to a read-write instruction aiming at the low-speed high-reliability service, and performing disk data read-write operation by using a mode of combining RAID1 and HDD media.

In order to solve the above technical problem, the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above method when executing the computer program.

To solve the above technical problem, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above method.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

by applying the multi-hard-disk file caching method, the device, the computer equipment and the storage medium based on the hash ring, the hash value corresponding to the file name of the target file is calculated in response to the read-write instruction of the target file; utilizing the number of the mounted catalogues under the cache catalogues to carry out remainder operation on the hash value to obtain the remainder of the hash value; and determining a unique path of the target file under the cache directory based on the remainder of the hash value. Therefore, the read-write performance of the cached small files can be improved, the server resources can be fully utilized, and the dependence on a file system can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a file caching method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a general cache organization provided by an embodiment of the present invention;

fig. 3 is a structural diagram of a file caching apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of a computer device provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In the prior art, the common RAID schemes mainly include RAID0, RAID1, RAID5, RAID6, and RAID10. The RAID0 only improves the read-write performance and does not provide a data redundancy function, the principle is that a plurality of hard disks are combined into a large hard disk, when data are read and written, continuous data are cut into a plurality of blocks according to rules, and the data are written into a plurality of hard disks in a scattered mode to be read and written at the same time, so that the read-write bandwidth is improved, and the read-write speed of the magnetic disks in the same time is improved by N times through the parallel operation of the hard disks. However, since it has no data redundancy function, almost all data is affected if one of the hard disks is destroyed. RAID1 is just opposite to RAID0, only provides data redundancy function, and the read-write performance depends on the hard disk with the smallest bandwidth, which is characterized in that the data security is ensured, and the data is secure as long as one hard disk is arranged in the hard disk group. RAID5 is the most used scheme at present, and is equivalent to additionally using one hard disk for data verification on the basis of RAID0, so that the problem of data loss due to damage of the hard disk does not occur, and in order to reduce the load of the data verification device, RAID5 writes verification data into all hard disks in turn, which is also a difference between RAID5 and RAID 3. RAID6 mainly increases the number of redundant blocks compared to RAID5, but RAID5 does not improve sequential read performance against continuous small IO (data IO size is smaller than cache block size) compared to single disk, and write performance is rather degraded due to the overwrite problem. It can be known from the above conventional RAID technology that a group of disks with the same specification is generally required, and the server not only can commonly mount a plurality of disks, but also can additionally mount other disks (such as a disk expansion cabinet) in other ways, and if it is desired to use disks from multiple sources to form a cache disk group in the face of a complex device, the RAID technology has some limitations. Because RAID provides a block device and cannot directly provide a read-write interface at a file level, the block device must be managed by an intermediate layer, that is, a file system, and existing file systems are various and have various features except for basic functions of creating, reading, writing, deleting, and the like, some of them can support a larger storage space, some of them support more file numbers, some of them support an ultra-long file name, and some of them support an ultra-large single file. Users often use different file systems according to specific business requirements.

In view of this, the method aims to solve the problem that the improvement of the read-write performance of the small file is beneficial to a limited extent and even reduced by simply using the RAID technology in the prior art; the support is not good under the condition of different disk specifications, which is not beneficial to fully utilizing server resources; the invention provides a file caching method, a device, computer equipment and a storage medium, which solve the problems that most file systems have limit on supporting capacity, limit on the number of files in a single directory or slow reading and writing speed after the number of files in the directory is increased when the service scenes of caching are met, and have difficulty in continuously maintaining the performance of the file systems in the service scenes with large capacity and large number by using the conventional RAID technology when the service scenes of rich services are met.

The following describes a file caching method provided by the present invention.

As shown in fig. 1, a flowchart of a file caching method provided by the present invention may include the following steps:

step S101: and responding to the read-write instruction of the target file, and calculating the hash value corresponding to the file name of the target file.

In one case, the file name of the target file is hashed based on a hash algorithm, so as to obtain a hash value corresponding to the file name of the target file.

Step S102: and utilizing the number of the mounted catalogues under the cache catalogues to carry out remainder on the hash value to obtain the remainder of the hash value.

Step S103: and determining a unique path of the target file under the cache directory based on the remainder of the hash value. In one case, the remainder of the hash value is the directory location of the target file in the corresponding directory hierarchy.

In one case, referring to fig. 2, the unique path of the target file under the cache directory may be determined as follows: dividing the total cache capacity by the average volume of the cached files to obtain the total cache file number; dividing the total cached file number by the maximum file number under the directories to determine the number of the leaf directories to be cached; and determining the unique path of the target file under the cache directory according to the number of the leaf directories and the maximum directory level of a file system.

Furthermore, the file caching scheme is applied to a block device layer of a disk system, RAID is used for organizing and managing the disks, and the file caching scheme is used for meeting the requirements of different services on read-write performance and data redundancy; and responding to a read-write instruction aiming at the low-speed high-reliability service, and performing disk data read-write operation by using a mode of combining RAID1 and HDD media.

On the basis, a user can select different file systems according to the service characteristics, establish the file systems on each RAID, and mount the file systems under the cache directory, so that the problem that some file systems have capacity limitation can be solved. Each directory under the cache directory corresponds to an independent file system, the directory names of the directories are sequentially increased from 0, and the value of the number should be prime as much as possible so as to equalize the number of total files in each directory.

In each independent file system, a user can configure directory hierarchies according to the service volume and the characteristics of the file system so as to avoid the problem that the performance of some file systems is reduced due to the limitation of the number of single directory files or the excessive number of single directory files.

It should be noted that the numbers of subdirectories of each level of directory of the cache directory need to be prime numbers, and the numerical values are different, so as to avoid unbalanced file distribution, so that when the number of each level of directory is calculated, the number of the cache leaf directories can be approximated by multiplying different prime numbers by referring to a prime number table formed by each prime number. Typically, a few primes of 1000 or less are sufficient, for example, if the primes are combined to [3, 919, 929, 937, 941], 2258300311401 leaf directories are provided, that is, there are 3 directories under the cache directory, and 919 directories under each of the 3 directories, and so on.

By applying the scheme of the invention, the reading and writing instruction of the target file is responded, and the hash value corresponding to the file name of the target file is calculated; utilizing the number of the mounted catalogues under the cache catalogues to carry out remainder operation on the hash value to obtain the remainder of the hash value; and determining a unique path of the target file under the cache directory based on the remainder of the hash value. Therefore, the read-write performance of the cached small files can be improved, server resources can be fully utilized, and the dependence on a file system can be reduced.

The following describes a file caching apparatus according to the present invention.

As shown in fig. 3, a structure diagram of a file caching apparatus provided in an embodiment of the present invention includes:

the hash calculation module 210 is configured to respond to a read-write instruction of a target file, and calculate a hash value corresponding to a file name of the target file;

a hash value processing module 220, configured to use the number of the mounted directories in the cache directory to obtain a remainder of the hash value;

and the caching module 230 is configured to determine a unique path of the target file under the caching directory based on a remainder of the hash value.

In one case, the hash calculation module 210 is configured to perform a hash calculation on the file name of the target file based on a hash algorithm, so as to obtain a hash value corresponding to the file name of the target file.

In another case, the caching module 230 is configured to obtain a total number of cached files by dividing the total cached capacity by the average volume of the cached files; determining the number of leaf directories to be cached by dividing the total cached file number by the maximum file number under the directories; and determining the unique path of the target file under the cache directory according to the number of the leaf directories and the maximum directory level of a file system.

In another case, the cache module 230 is further configured to perform, in response to a read-write instruction for a high-speed low-reliability service, a read-write operation on the disk data in a form of combining RAID0 and an SSD medium; and responding to a read-write instruction aiming at the low-speed high-reliability service, and performing disk data read-write operation by using a mode of combining RAID1 and HDD medium.

By applying the scheme of the invention, the reading and writing instruction of the target file is responded, and the hash value corresponding to the file name of the target file is calculated; obtaining the remainder of the hash value by using the number of the mounted directories under the cache directory; and determining a unique path of the target file under the cache directory based on the remainder of the hash value. Therefore, the read-write performance of the cached small files can be improved, the server resources can be fully utilized, and the dependence on a file system can be reduced.

The following describes a computer device provided in an embodiment of the present invention.

To solve the above technical problem, the present invention provides a computer device, as shown in fig. 4, including a memory 310, a processor 320 and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the method as described above.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer device may include, but is not limited to, a processor 320, a memory 310. Those skilled in the art will appreciate that fig. 4 is merely an example of a computing device and is not intended to be limiting and may include more or fewer components than those shown, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.

The Processor 320 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 310 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 310 may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Further, the memory 310 may also include both an internal storage unit and an external storage device of the computer device. The memory 310 is used for storing the computer program and other programs and data required by the computer device. The memory 310 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the present application further provides a computer-readable storage medium, which may be a computer-readable storage medium contained in the memory in the foregoing embodiment; or it may be a computer-readable storage medium that exists separately and is not incorporated into a computer device. The computer-readable storage medium stores one or more computer programs which, when executed by a processor, implement the methods described above.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory 310, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

For system or apparatus embodiments, since they are substantially similar to method embodiments, they are described in relative simplicity, and reference may be made to some descriptions of method embodiments for related points.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if the described condition or event is detected" may be interpreted to mean "upon determining" or "in response to determining" or "upon detecting the described condition or event" or "in response to detecting the described condition or event", depending on the context.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A file caching method is characterized by comprising the following steps:

obtaining the remainder of the hash value by using the number of the mounted directories under the cache directory; each directory under the cache directory corresponds to an independent file system, and the directory hierarchy in each independent file system is configured according to the traffic and the characteristics of the file systems; the number of the subdirectories of each directory hierarchy is prime number and the numerical values are different;

determining the number of leaf directories to be cached by dividing the total cached file number by the maximum file number under each directory;

according to the number of the leaf directories, calculating the number of each level of directory in a mode of multiplying different prime numbers in a prime number table to approximate the number of the leaf directories;

2. The method according to claim 1, wherein the calculating a hash value corresponding to the file name of the target file comprises:

3. The file caching method according to claim 1, further comprising:

4. A file caching apparatus, comprising:

the hash value processing module is used for utilizing the number of the mounted catalogues under the cache catalogues to carry out remainder on the hash value to obtain the remainder of the hash value; each directory under the cache directory corresponds to an independent file system, and the directory hierarchy in each independent file system is configured according to the traffic and the characteristics of the file systems; the number of the subdirectories of each directory hierarchy is prime number and the numerical values are different;

a cache module for

5. The file caching apparatus of claim 4, wherein the hash calculation module is configured to:

6. The file caching apparatus according to claim 4, wherein the caching module is further configured to:

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the computer program.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.