WO2023185770A1

WO2023185770A1 - Cloud data caching method and apparatus, device and storage medium

Info

Publication number: WO2023185770A1
Application number: PCT/CN2023/084183
Authority: WO
Inventors: 余洋; 孙相征; 何万青
Original assignee: 阿里巴巴（中国）有限公司
Priority date: 2022-03-28
Filing date: 2023-03-27
Publication date: 2023-10-05
Also published as: CN114840140A

Abstract

Disclosed in the present application are a cloud data caching method and system, a device and a storage medium. The method comprises: in the embodiments of the present application, a multi-level data caching architecture comprising a cache layer, a cache management layer and a cache client can be constructed on a cloud host, and, when a user file to be cached is obtained, a grading feature corresponding to said user file is determined, and said user file is cached to a storage area of a corresponding level on the basis of the grading feature. In the embodiments of the present application, a complex cloud IO scenario can be flexibly dealt with according to a file distribution hash table recording the storage position of each user file. In addition, in the embodiments of the present application, the cache layer is constructed by using idle resources of the cloud host, so that cloud resources can be fully utilized, and the pressure of upper layer processing IO is effectively relieved.

Description

Cloud data caching method, device, equipment and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on March 28, 2022, with application number 202210313588.0 and the application name "Cloud Data Caching Method, Device, Equipment and Storage Medium", the entire content of which is incorporated by reference. incorporated in this application.

Technical field

This application belongs to the field of computer technology, and specifically relates to a cloud data caching method, system, equipment and storage medium.

Background technique

With the rapid development of cloud computing technology, more and more HPC (High Performance Computing, high-performance computer cluster) industry users are migrating operating data to the cloud. The HPC industry in scenarios such as film and television rendering, biological information, and geological exploration usually requires massive computing resources. The computing process is accompanied by a large number of file reading and writing operations, which places extremely high requirements on cloud file storage performance.

Since the IO (Input/Output, input/output) characteristics of different HPC scenarios vary widely, the required specific storage performance indicators such as throughput rate, IOPS (Input/Output Operations Per Second, reads and writes per second), and latency also vary greatly. , resulting in HPC scenarios often encountering many storage problems when migrating to the cloud, such as storage performance not matching requirements, a single storage mode being difficult to cope with complex IO characteristics, and multiple sets of storage management being highly complex.

Contents of the invention

This application proposes a cloud data caching method, system, equipment and storage medium, which can build a multi-level data caching architecture on the cloud host including a caching layer, a caching management layer and a caching client, and record each user file The file distribution hash table of the storage location can flexibly cope with complex IO scenarios on the cloud. In addition, in the embodiment of this application, the idle resources of the cloud host are used to build the cache layer, so that cloud resources can be fully utilized and the upper layer processing IO pressure can be effectively alleviated.

The first embodiment of the present application proposes a data caching method on the cloud. The method includes:

Cache user files through the cloud host's distributed memory, virtual disk, and file system mounted on the cloud host;

When the user file to be cached is obtained, the hierarchical characteristics corresponding to the user file are determined, and based on the hierarchical characteristics, the user files are cached to the storage area of the corresponding level. The hierarchical characteristics include access frequency and modification frequency. And at least one of the amount of data, the different levels of storage areas include the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host.

The third embodiment of the present application provides a cloud data cache system, including a data source layer, a cache layer on a cloud host, a cache management and control layer, a cache client on the cloud host, and an HPC processing end, where:

The data source layer includes cloud low-frequency file storage, cloud object storage and IDC file storage;

The cache layer includes a file system, distributed memory and virtual disk mounted on the cloud host. The cache layer is used to cache frequently accessed user files;

The cache management and control layer includes a cache configuration center, a file access characteristic statistics table and a file distribution hash table. The cache management and control layer is used to manage cached user files;

Wherein, the cache client is used to provide a data operation interface for the HPC processing layer and process IO requests.

An embodiment of the fourth aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor runs the computer program to Implement the method described in the first or second aspect above.

The embodiment of the fifth aspect of the present application provides a computer-readable storage medium on which a computer program is stored, and the program is executed by a processor to implement the method described in the first or second aspect.

The technical solutions provided in the embodiments of this application have at least the following technical effects or advantages:

In the embodiment of this application, a multi-level data cache architecture including a cache layer, a cache management layer, and a cache client can be constructed on the cloud host, and a flexible response can be implemented based on the file distribution hash table that records the storage locations of each user's files. Complex IO scenarios on the cloud. In addition, in the embodiment of this application, the idle resources of the cloud host are used to build the cache layer, so that cloud resources can be fully utilized and the upper layer processing IO pressure can be effectively alleviated.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be construed as limiting the application. Also throughout the drawings, the same reference characters are used to designate the same components. In the attached picture:

Figure 1 shows an operation flow chart of a cloud data caching method provided by an embodiment of the present application;

Figure 2 shows an architectural diagram of a cloud data caching system provided by an embodiment of the present application;

Figure 3 shows a schematic structural diagram of a cloud data caching device provided by an embodiment of the present application;

Figure 4 shows a schematic structural diagram of an electronic device provided by an embodiment of the present application;

Figure 5 shows a schematic diagram of a storage medium provided by an embodiment of the present application.

Detailed ways

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough understanding of the present application, and to fully convey the scope of the present application to those skilled in the art.

It should be noted that, unless otherwise stated, the technical terms or scientific terms used in this application should have the usual meanings understood by those skilled in the art to which this application belongs.

A cloud data caching method, system, device and storage medium proposed according to embodiments of the present application will be described below with reference to the accompanying drawings.

The cloud data caching method provided by the embodiment of this application. Referring to Figure 1, the method specifically includes the following steps:

Step 101: Cache user files through the cloud host's distributed memory, virtual disk, and file system mounted on the cloud host.

In related technologies, HPC processing (such as film and television rendering, biological information, and underground exploration) usually requires massive computing resources. The computing process is accompanied by a large number of file reading and writing operations, so the performance requirements for cloud file storage are extremely high.

However, since the IO characteristics of different HPC processing scenarios vary widely, the specific storage performance indicators (throughput, IOPS, latency, etc.) required also vary greatly. Considering the diversity of IO characteristics, HPC processing cloud migration often encounters the following storage problems: storage performance does not match processing requirements. That is, for users who are migrating to the cloud for the first time, it is difficult to select the file storage specification that best matches the processing IO characteristics at one time. Subsequent changes in storage specifications involve changes to underlying hardware resources and data migration, which is very costly. In addition, due to the diversity of offline HPC processing, users may face the situation that low-end storage performance on the cloud cannot meet processing needs, while high-end storage specifications overflow performance and are expensive. It is difficult for a single storage to cope with complex IO characteristics 1. HPC processing usually involves a huge amount of data, and the access characteristics (access frequency, block size) of different data vary greatly. Storing all data on a single storage is difficult to control costs.

Based on the above existing problems, this application proposes a method of caching user files through the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host to implement data caching. Among them, the specific application includes the cloud data caching system applied to the data source layer, the cache layer on the cloud host, the cache management and control layer, the cache client on the cloud host, and the HPC processing end.

As shown in Figure 2, the data source layer in the cloud data caching system in this application includes cloud low-frequency file storage, cloud object storage and IDC file storage. And the cache layer includes the file system, distributed memory and virtual disk mounted on the cloud host. The cache layer is used to cache frequently accessed user files; and the cache management and control layer includes the cache configuration center, file access characteristic statistics table and file Distributed hash table, cache management and control layer is used to manage cached user files; among them, the cache client is used to provide a data operation interface for the HPC processing layer and process IO requests.

Specifically, for low-frequency file storage on the cloud in the data source layer, the transmission of user file data between various cloud hosts can be supported based on file system mounting. It should be noted that it can support standard file IO interface. And only used for caching user files whose data access frequency is low.

In addition, for cloud object storage, it can support the transmission and access of user file data between cloud hosts through specific API interfaces, which has certain advantages in data distribution. One approach, which is also used only for less frequently accessed data, is the caching of user files.

Furthermore, for IDC file storage, the file storage located in the user's local computer room supports the user's local processing on the one hand, and on the other hand is connected to the cloud network through a dedicated line/VPN.

Furthermore, for the file system mounted on the cloud host in the cache layer, it is a high-frequency file cache area on the cloud, which can be used to support the persistent global cache layer for data sharing between cloud hosts, and has better performance than low-frequency file storage. Stronger. In this embodiment of the present application, it can be used to cache files in the data source that are accessed more frequently, have larger file data volumes, and frequently change data, such as core user files for HPC tasks.

Additionally, for disk caching, it can be used to persist local cache layers. In the embodiment of this application, the free disk space of the cloud host is used to cache files in data sources with high access frequency, small file data volume, and infrequent data changes, such as: processing software, program plug-ins, pre- and post-processing scripts, etc. In one method, the disk cache capacity can be expanded by adding a data disk to the cloud host. Among them, the user files cached in the disk cache used for local caching can be user files exclusive to a certain cloud host.

In addition, for distributed memory, it can be used for non-persistent global cache layer. That is to say, it and the high-frequency file cache area on the cloud can also be used to support the persistent global cache layer that supports data sharing between cloud hosts. In the embodiment of this application, the free memory of multiple cloud hosts builds a memory file system through tmpfs/ramdisk and other methods to form a distributed memory cache layer, and is uniformly managed by the cache management and control layer to cache data sources that are frequently used. The file block being accessed. The more cloud hosts there are during peak user processing periods, the larger the memory cache space will be.

Furthermore, the cache configuration center of the cache management and control layer in the cloud data cache system of this application can be used to maintain the cache configuration information of cached data in the system and provide cache control interfaces to users. In one method, users can enable and close each cache layer by interacting with the cache configuration center, thereby cooperating with the cache client to achieve User files can be retrieved at any time. In addition, the cache cold data cleaning strategy can also be implemented, such as regularly cleaning data with low data access frequency based on the memory/disk occupancy ratio and file popularity. Furthermore, caching strategies can also be customized, such as data prefetching based on specific file names.

In addition, the file access characteristic statistics table can be used to maintain the file access characteristics of the upper-layer HPC Workload, including but not limited to file access popularity, access mode (sequential access, random access), read file block size, etc. Supports periodic dimension statistics (for example, by month/day/hour/minute). File access characteristics statistics tables are also used to provide input for cache cold data cleanup and data flow.

Optionally, the file distribution hash table is used to maintain the storage location of files/file blocks in the cache layer, supporting HPC Workload to efficiently obtain target files/file blocks and ensuring the efficiency of upper-layer processing.

It should be noted that the three components of the cache management and control layer in the data cache system (i.e. cache configuration center, file access characteristic statistics table and file distribution hash table) can be used to store cache core data, and the storage method is not limited to database, redis, files, and ensure the consistency of data access/update under the distributed architecture through mutex locks.

Further explanation, the cache client of this application can provide file system entry and standard POSIX file operation interface for the HPC processing layer upward, and is responsible for real-time processing of IO requests downward.

For the process of interacting with the HPC processing layer, you can first obtain the real-time IO request sent by the upper-layer HPC processing, and obtain the location in the cache of the target file/file block corresponding to the requested user file from the file distribution hash table.

Among them, if the target file/file block is an uncached file, the target file/file block can be read from the data source and passed to the upper layer for processing, and at the same time, the target file/file block can be cached to a high-frequency file on the cloud. Store and synchronously update files and their mapping relationships into the file distribution hash table.

And if the target file/file block is a file that has been cached in the cloud high-frequency file storage or disk cache but has not been cached in the distributed memory cache or the cache is invalid, the file is read from the target cache and passed to the upper layer for processing. At the same time, the read target file/file block is cached in the distributed memory cache, and the file and its mapping relationship are synchronously updated into the file distribution hash table.

If the target file/file block is a file block that has been cached in the distributed memory cache, the target file/file block is directly read from the distributed memory cache and passed to the upper layer for processing.

Among them, the cache client is responsible for the real-time processing of IO requests. It can periodically update the statistical hierarchical feature information to the file access feature table and obtain the relevant configuration information of the cache configuration center. According to the time-series changes in file access heat and read files Block size, combined with the configuration information of the cache configuration center, realizes the flow of hot data between high-frequency file cache and disk cache on the cloud and the cleanup of cold data at each cache layer, and simultaneously updates the file distribution hash table.

Step 102: When the user file to be cached is obtained, the hierarchical characteristics corresponding to the user file are determined, and based on the hierarchical characteristics, the user files are cached to the storage area of the corresponding level. The hierarchical characteristics include access frequency, modification frequency and data volume, which vary. The level storage area includes the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host.

In one way, in the cloud data caching method proposed in this application, each user file to be cached can be cached in a corresponding hierarchical manner according to one of access frequency, modification frequency, and data amount. Among them, the distributed memory, the virtual disk and the file system mounted on the cloud host can jointly form the multi-level cache area.

In one method, the distributed memory cache can correspond to cached user files: user files with high cache access frequency (i.e. hot data) and file block granularity and small capacity. At the same time, it can also provide low latency for upper-layer processing. data access. Shared cache.

In one method, the disk cache can correspond to cached user files: files with high cache access frequency (i.e. hot data), file block granularity, small capacity, and infrequent changes, to avoid single points that may occur in shared caches. IO bottleneck and share the pressure on the distributed memory cache.

Another way is that the file system mounted on the cloud host can correspond to the cached user files: cache files with high access frequency (i.e., warm data) and user files with file block granularity, large capacity, and frequent changes. Understandably, as the backend of distributed memory cache and disk cache, it carries most of the hot/warm data required for upper-layer processing, thus reducing access to data sources.

In addition, the cloud data caching system in this application can support hot swapping, support docking with data sources on and off the cloud, and support horizontal and vertical expansion.

In one method, if an IO request for a user file is received from the sending object (for example, the HPC processing layer). That is, the corresponding cache location can be determined based on the preset file distribution hash table (the file distribution hash table records the mapping relationship between each user file and the cache location). And extract cache data from the corresponding cache location and reply to the sending object.

Specifically, the following steps may be included:

Step 1. First, match the requested user file in the IO request with the file distribution hash table to determine the storage location of the requested user file on the cloud host.

Step 2. If it is determined that the requested user file is not stored on the cloud host, obtain the requested user file from the associated data source on the cloud host. Otherwise skip to step 5.

Step 3. Cache the requested user file to the high-frequency file storage area on the cloud in the cloud host (i.e., one of the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host), and cache the requested user file. User files and high The mapping relationship of the frequency file storage area is updated into the file distribution hash table.

Step 4. Send the requested user file to the sender of the IO request.

Step 5. If it is determined that the requested user file is stored on the cloud host, determine whether the requested user file is cached in distributed memory.

Step 6. If cached in distributed memory, send the requested user file to the sender of the IO request.

Step 7. If it is not cached in distributed memory, cache the requested user file into distributed memory, update the mapping relationship between the requested user file and distributed memory to the file distribution hash table, and send the requested user file to The sender of the IO request.

In addition, for data management in the system, one method can be to periodically determine the access popularity value of each user file based on the cache configuration information set by the user and the file access characteristics of the user files collected online. And clean up user files whose access heat value is lower than the preset heat value.

In another method, the space size change of each user file can also be determined periodically based on the cache configuration information set by the user and the file access characteristics of the user files in online statistics. Therefore, based on subsequent changes in the space size of each user file, the user files are cached from the disk cache to the distributed cache, or the distributed cache is cached to the disk cache.

In the embodiment of this application, user files are cached through the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host; and when the user file to be cached is obtained, the hierarchical characteristics corresponding to the user file are determined. And based on the hierarchical characteristics, the user files are cached to the storage area of the corresponding level. The hierarchical characteristics include at least one of the access frequency, the modification frequency and the amount of data. The different levels of storage areas include the distributed memory of the cloud host, the virtual disk and the cloud host. The file system mounted on.

In addition, in the embodiment of the present application, IO requests for user files can also be responded to based on the file distribution hash table. The file distribution hash table includes the mapping relationship between user files and cache locations; based on the cache configuration information and online statistics set by the user. File access characteristics of user files, and data management of cached user files. A multi-level data cache architecture constructed on a cloud host in the embodiment of the present application including a cache layer, a cache management layer and a cache client can flexibly cope with the cloud based on the file distribution hash table that records the storage locations of each user's files. Complex IO scenarios. In addition, in the embodiment of this application, the idle resources of the cloud host are used to build the cache layer, so that cloud resources can be fully utilized and the upper layer processing IO pressure can be effectively alleviated.

Optionally, in one aspect of the embodiment of the present application, user files can be cached to storage areas of corresponding levels based on hierarchical characteristics, including:

Determine the user file to be a low-frequency access user file, and cache the low-frequency access user file to the low-frequency cloud server in the cloud host. In the file storage area, low-frequency access user files are user files whose access frequency is lower than the first preset frequency;

The user files are determined to be high-frequency access user files, and the high-frequency access user files are cached in a storage area of a corresponding level. The high-frequency access user files are user files whose access frequency is not lower than the first preset frequency.

In one way, this application can divide each user file to be cached into a high-frequency access user file or a low-frequency access user file. And store the high-frequency access user files in the cloud high-frequency file storage area in the cloud host (that is, one of the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host). And store the cached user files of low-frequency access user files to the cloud low-frequency file storage area in the cloud host.

It can be understood that user files that are accessed frequently are user files that users perform relatively frequent retrieval operations (that is, user files whose access frequency is not lower than the first preset frequency). The low-frequency accessed user files are user files that the user performs relatively infrequent retrieval operations (that is, user files whose access frequency is lower than the first preset frequency).

Optionally, on the one hand, in the embodiment of the present application, it is determined that the user file is a high-frequency access user file, and the high-frequency access user file is cached in a corresponding level of storage area, including:

Cache user files that correspond to high-frequency access to user files, and the data volume is not lower than the preset space threshold, and the modification frequency is not lower than the second preset frequency, into the file system mounted on the cloud host;

Cache user files that correspond to frequently accessed user files and whose data volume is lower than the preset space threshold into virtual disks or distributed memory.

Optionally, on the one hand, in the embodiment of the present application, user files corresponding to high-frequency access to user files and the data volume is lower than the preset space threshold are cached into a virtual disk or a file system mounted on the cloud host. ,include:

Determine the modification frequency corresponding to cached user files that correspond to frequently accessed user files and whose data volume is lower than the preset space threshold;

Store cached user files whose modification frequency is lower than the second preset frequency in the virtual disk; or,

Store cached user files in which the modification frequency is not lower than the second preset frequency in the distributed memory.

In one method, the distributed memory cache can correspond to cached user files: user files with high cache access frequency and file block granularity and small capacity (that is, corresponding to user files with high frequency access and the data volume is lower than the preset space). threshold user files), and it can also provide low-latency data access for upper-layer processing. Shared cache.

In one method, the disk cache can correspond to cached user files: files with high cache access frequency, file block granularity, small capacity, and infrequent changes (that is, files corresponding to high-frequency access user files, and the data volume is less than User files with a preset space threshold and a change frequency lower than the second preset frequency) avoid single-point IO bottlenecks that may occur in shared caches and share the pressure on distributed memory caches.

Another way is that the file system mounted on the cloud host can correspond to the cached user files: cache access frequency Files with higher file block granularity, user files with large capacity and frequent changes (that is, corresponding to user files with high frequency access, and the data volume is not lower than the preset space threshold, and the change frequency is not lower than the second preset frequency user file). Understandably, as the backend of distributed memory cache and disk cache, it carries most of the hot/warm data required for upper-layer processing, thereby reducing access to data sources.

Optionally, in one aspect of the embodiment of the present application, after caching user files to a storage area of a corresponding level, it also includes:

Based on the cache configuration information set by the user and the file access characteristics of user files collected online, the access popularity value of each user file is periodically determined;

Clean up user files whose access heat value is lower than the preset heat value.

Based on the cache configuration information set by the user and the file access characteristics of user files collected online, the space size change of each user file is periodically determined;

Based on the change in space size of each user file, the user files are cached in different storage areas. The storage area includes one of disk cache and distributed cache.

In one method, for data management in the system, one method can be to periodically determine the access popularity value of each user file based on the cache configuration information set by the user and the file access characteristics of the user files collected online. And clean up user files whose access heat value is lower than the preset heat value.

In addition, this application can periodically update the statistical hierarchical feature information to the file access feature table and obtain the relevant configuration information of the cache configuration center. According to the change in the read file block space size, combined with the configuration information of the cache configuration center, the data flow of each storage area is realized (i.e. Based on the change in space size of each user file, the user files are cached from the disk cache to the distributed cache, or the distributed cache is cached to the disk cache). And update the file distribution hash table synchronously.

Receive IO requests for user files;

Match the requested user file in the IO request with the file distribution hash table to determine the storage location of the requested user file on the cloud host. The file distribution hash table includes the mapping relationship between the user file and the cache location;

If it is determined that the requested user file is not stored on the cloud host, obtain the requested user file from the associated data source on the cloud host;

Cache the requested user file to the storage area of the corresponding level, and cache the requested user file with the storage area of the corresponding level. The mapping relationship of the domain is updated to the file distribution hash table;

Send the requested user file to the sender of the IO request.

In one way, if an IO request for a user file is received from the sending object (for example, the HPC processing layer). That is, the corresponding cache location can be determined based on the preset file distribution hash table (the file distribution hash table records the mapping relationship between each user file and the cache location). And extract cache data from the corresponding cache location and reply to the sending object.

Specifically, the following steps may be included:

Step 3: Cache the requested user file to the cloud high-frequency file storage area in the cloud host, and update the mapping relationship between the requested user file and the high-frequency file storage area into the file distribution hash table.

Step 4. Send the requested user file to the sender of the IO request.

In addition, after matching the requested user file in the IO request with the file distribution hash table and determining the storage location of the requested user file on the cloud host, the embodiment of this application also includes:

If it is determined that the requested user file is stored on the cloud host, determine whether the requested user file is cached in distributed memory;

If cached in distributed memory, the requested user file will be sent to the sender of the IO request;

If it is not cached in distributed memory, the requested user file will be cached in distributed memory, the mapping relationship between the requested user file and distributed memory will be updated to the file distribution hash table, and the requested user file will be sent to the IO requester. sender.

In the embodiment of this application, a multi-level data cache architecture including a cache layer, a cache management layer, and a cache client can be constructed on the cloud host, and a flexible response can be implemented based on the file distribution hash table that records the storage locations of each user's files. Complex IO scenarios on the cloud. In addition, in the embodiment of this application, since the idle resources of the cloud host are used to build the cache layer, It can make full use of cloud resources and effectively alleviate the IO pressure of upper-layer processing.

Embodiments of the present application also provide a cloud data caching system, which includes a data source layer, a cache layer on a cloud host, a cache management and control layer, a cache client on the cloud host, and an HPC processing end, where:

The data source layer includes low-frequency file storage on the cloud, object storage on the cloud, and IDC file storage;

The cache layer includes the file system, distributed memory and virtual disk mounted on the cloud host. The cache layer is used to cache frequently accessed user files;

The cache management and control layer includes the cache configuration center, file access characteristic statistics table and file distribution hash table. The cache management and control layer is used to manage cached user files;

Among them, the cache client is used to provide a data operation interface for the HPC processing layer and process IO requests.

In one approach, the cloud data caching system proposed in this application does not distinguish data sources, which can be user local file storage, cloud low-frequency file storage, or cloud object storage.

Specifically, for low-frequency file storage on the cloud in the data source layer, the transmission of user file data between various cloud hosts can be supported based on file system mounting. In addition, for cloud object storage, it can support the transmission and access of user file data between cloud hosts through specific API interfaces, which has certain advantages in data distribution. One approach, which is also used only for less frequently accessed data, is the caching of user files. As for IDC file storage, its file storage located in the user's local computer room supports the user's local processing on the one hand, and on the other hand is connected to the cloud network through a dedicated line/VPN.

Furthermore, for the file system mounted on the cloud host in the cache layer, it can be used for persistent global cache to support the transmission of user file data between cloud hosts, and its performance is stronger than low-frequency file storage. In this embodiment of the present application, it can be used to cache files in the data source that are accessed more frequently, have larger file data volumes, and frequently change data, such as core user files for HPC tasks.

Additionally, for disk caching, it can be used to persist local cache layers. In the embodiment of this application, the free disk space of the cloud host is used to cache files in data sources with high access frequency, small file data volume, and infrequent data changes, such as: processing software, program plug-ins, pre- and post-processing scripts, etc. In one method, the disk cache capacity can be expanded by adding a data disk to the cloud host.

In addition, for distributed memory, it can be used in the non-persistent global cache layer and use cloud host memory. In the embodiment of this application, the free memory of multiple cloud hosts builds a memory file system through tmpfs/ramdisk and other methods to form a distributed memory cache layer, and is uniformly managed by the cache management and control layer to cache data sources that are frequently used. The file block being accessed. The more cloud hosts there are during peak user processing periods, the larger the memory cache space will be.

Furthermore, for the cache configuration center of the cache management and control layer in the cloud data cache system of this application, it can be Used to maintain cache configuration information of cached data in the system and provide cache control interfaces to users. In one method, users can turn on and off each cache layer through interaction with the cache configuration center, so as to cooperate with the cache client to achieve the effect of retrieving user files at any time. In addition, the cache cold data cleaning strategy can also be implemented, such as regularly cleaning data with low data access frequency based on the memory/disk occupancy ratio and file popularity. Furthermore, caching strategies can also be customized, such as data prefetching based on specific file names.

Optionally, the file distribution hash table is used to maintain the storage location of user files/user file blocks in the cache layer, supporting HPC Workload to efficiently obtain target files/file blocks, and ensuring the efficiency of upper-layer processing.

Optionally, the cloud data caching system in this application can also include:

Cache low-frequency access user files to low-frequency file storage on the cloud, where low-frequency access user files are user files whose access frequency is lower than the preset frequency.

Optionally, the cloud data caching system in this application can also include:

Among the frequently accessed user files obtained from the data source layer, cached user files whose data volume is not less than the preset space threshold are stored in distributed memory;

Among the frequently accessed user files obtained from the data source layer, the user files whose data volume is lower than the preset space threshold and whose modification frequency is lower than the preset frequency are stored in the virtual disk;

Among the frequently accessed user files obtained from the data source layer, the user files whose data volume is lower than the preset space threshold and whose modification frequency is not lower than the preset frequency are stored in the file system mounted on the cloud host.

Optionally, the cloud data caching system in this application can also include:

The cache configuration center is used to maintain cache configuration information and provide cache control interfaces to users;

The file access characteristics statistics table is used to collect file access characteristics of the HPC processing layer, including file access characteristics including file access popularity, access mode, and data volume of user files.

The cloud data caching system provided by the above embodiments of the present application and the cloud data caching method provided by the embodiments of the present application are based on the same inventive concept, and have the same advantages as the methods adopted, run or implemented by the applications stored therein. beneficial effect.

An embodiment of the present application also provides a cloud data caching device, which is configured to perform operations performed by the cloud data caching method provided in any of the above embodiments. As shown in Figure 3, the device includes:

Deployment module 201 is used to cache user files through the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host;

The response module 202 is configured to, when obtaining a user file to be cached, determine the hierarchical characteristics corresponding to the user file, and cache the user file to a storage area of the corresponding level based on the hierarchical characteristics. The hierarchical characteristics Including at least one of access frequency, modification frequency and data volume, the different levels of storage areas include distributed memory of the cloud host, virtual disks and file systems mounted on the cloud host.

The response module 202 is specifically configured to determine that the user file is a low-frequency access user file, and cache the low-frequency access user file to the cloud low-frequency file storage area in the cloud host. The low-frequency access user file is a low-frequency access user file. User file at the first preset frequency;

The response module 202 is specifically configured to determine that the user file is a user file with high frequency access, and cache the user file with high frequency access to the storage area of the corresponding level. The user file with high frequency access is an access frequency no less than The user file of the first preset frequency.

The response module 202 is specifically configured to cache the user files corresponding to high-frequency access to user files, the data volume is not lower than the preset space threshold, and the modification frequency is not lower than the second preset frequency to the cloud host. In the mounted file system;

The response module 202 is specifically configured to cache the user files corresponding to high-frequency access user files and whose data volume is lower than the preset space threshold into the virtual disk or the distributed memory.

The response module 202 is specifically used to determine the modification frequency corresponding to the cached user files that correspond to high-frequency access user files and whose data volume is lower than the preset space threshold;

The response module 202 is specifically configured to store cached user files whose modification frequency is lower than the second preset frequency into the virtual disk; or,

The response module 202 is specifically configured to store cached user files whose modification frequency is not lower than the second preset frequency into the distributed memory.

The deployment module 201 is specifically configured to periodically determine the access popularity value of each user file based on the cache configuration information set by the user and the file access characteristics of the user file in online statistics;

The deployment module 201 is specifically used to clean up user files whose access popularity value is lower than the preset popularity value.

The deployment module 201 is specifically configured to periodically determine the space size change of each user file based on the cache configuration information set by the user and the file access characteristics of the user file according to online statistics;

The deployment module 201 is specifically configured to cache each user file to a different storage area based on the change in space size of each user file. The storage area includes one of the disk cache and the distributed cache.

Deployment module 201 is specifically used to receive IO requests for user files;

Deployment module 201 is specifically used to match the requested user file in the IO request with a file distribution hash table, and determine the storage location of the requested user file on the cloud host. The file distribution hash table includes user Mapping relationship between files and cache locations;

The deployment module 201 is specifically configured to obtain the requested user file through the associated data source on the cloud host if it is determined that the requested user file is not stored on the cloud host;

The deployment module 201 is specifically configured to cache the requested user file to the storage area of the corresponding level, and update the mapping relationship between the requested user file and the storage area of the corresponding level into the file distribution hash table;

The deployment module 201 is specifically configured to send the requested user file to the sender of the IO request.

The cloud data caching device provided by the above embodiments of the present application and the cloud data caching method provided by the embodiments of the present application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, run or implemented by the applications stored therein. .

An embodiment of the present application also provides an electronic device to execute the above cloud data caching method. Please refer to FIG. 4 , which shows a schematic diagram of an electronic device provided by some embodiments of the present application. As shown in Figure 4, the electronic device 3 includes: a processor 300, a memory 301, a bus 302 and a communication interface 303. The processor 300, the communication interface 303 and the memory 301 are connected through the bus 302; the memory 301 stores available data. A computer program runs on the processor 300. When the processor 300 runs the computer program, it executes the cloud data caching method provided in any of the previous embodiments of this application.

Among them, the memory 301 may include high-speed random access memory (RAM: Random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the device network element and at least one other network element is realized through at least one communication interface 303 (which can be wired or wireless), and the Internet, wide area network, local network, metropolitan area network, etc. can be used.

The bus 302 may be an ISA bus, a PCI bus, an EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. The memory 301 is used to store a program. After receiving the execution instruction, the processor 300 executes the program. The cloud data caching method disclosed in any of the embodiments of the present application can be applied to the processor 300 , or implemented by the processor 300 .

The processor 300 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 300 . above The processor 300 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), application specific integrated circuit (ASIC), etc. ), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 301. The processor 300 reads the information in the memory 301 and completes the steps of the above method in combination with its hardware.

The electronic device provided by the embodiments of this application and the cloud data caching method provided by the embodiments of this application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, run or implemented.

The embodiment of the present application also provides a computer-readable storage medium corresponding to the cloud data caching method provided in the previous embodiment. Please refer to Figure 5. The computer-readable storage medium shown is an optical disk 30, on which is stored A computer program (i.e., a program product). When the computer program is run by a processor, the computer program will execute the cloud data caching method provided by any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory. Access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be described in detail here.

The computer-readable storage medium provided by the above embodiments of the present application is based on the same inventive concept as the cloud data caching method provided by the embodiments of the present application, and has the same beneficial effects as the methods used, run or implemented by the applications stored therein. .

It should be noted:

In the instructions provided here, a number of specific details are described. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known structures and techniques are not shown in detail so as not to obscure the understanding of this description.

Similarly, it will be understood that in the above description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together into a single embodiment, in order to streamline the application and assist in understanding one or more of the various inventive aspects. figure, or its description. However, the disclosed methods should not be interpreted to reflect the following schematic diagram: i.e. the claimed The application for protection requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this application.

Furthermore, those skilled in the art will understand that although some embodiments described herein include certain features included in other embodiments but not others, combinations of features of different embodiments are meant to be within the scope of the present application. within and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The above are only preferred specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes or modifications within the technical scope disclosed in the present application. Replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A data caching method on the cloud, characterized in that the method includes:

Cache user files through the cloud host's distributed memory, virtual disk, and file system mounted on the cloud host;

When the user file to be cached is obtained, the hierarchical characteristics corresponding to the user file are determined, and based on the hierarchical characteristics, the user files are cached to the storage area of the corresponding level. The hierarchical characteristics include access frequency and modification frequency. And at least one of the amount of data, the different levels of storage areas include the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host.
The method according to claim 1, characterized in that, based on the hierarchical characteristics, caching the user files to a storage area of a corresponding level includes:

Determine that the user file is a low-frequency access user file, and cache the low-frequency access user file to the cloud low-frequency file storage area in the cloud host. The low-frequency access user file is an access frequency lower than the first preset frequency. user files;

Determine that the user file is a high-frequency access user file, cache the high-frequency access user file to the storage area of the corresponding level, and the high-frequency access user file is an access frequency not lower than the first preset frequency user files.
The method according to claim 2, wherein determining that the user file is a high-frequency access user file and caching the high-frequency access user file to the storage area of the corresponding level includes:

Cache the user files corresponding to high-frequency access to user files, the data volume is not lower than the preset space threshold, and the modification frequency is not lower than the second preset frequency into the file system mounted on the cloud host;

The user files corresponding to frequently accessed user files and whose data volume is lower than the preset space threshold are cached in the virtual disk or the distributed memory.
The method according to claim 3, characterized in that, the user files corresponding to high-frequency access user files and the data volume is lower than the preset space threshold are cached to the virtual disk or the distributed in memory, including:

Determine the modification frequency corresponding to the cached user files that correspond to high-frequency access user files and whose data volume is lower than the preset space threshold;

Store cached user files whose modification frequency is lower than the second preset frequency into the virtual disk; or,

Store cached user files in which the modification frequency is not lower than the second preset frequency into the distributed memory.
The method according to claim 1, characterized in that, after caching the user file to a storage area of a corresponding level, it further includes:

Periodically determine the access popularity value of each user file according to the cache configuration information set by the user and the file access characteristics of the user file according to online statistics;

Clean up user files whose access heat value is lower than the preset heat value.
The method according to claim 1 or 5, characterized in that, after caching the user file to a storage area of a corresponding level, it further includes:

Periodically determine changes in the space size of each user file based on the cache configuration information set by the user and the file access characteristics of the user files collected online;

Based on the change in space size of each user file, the user files are cached in different storage areas, and the storage area includes one of the disk cache and the distributed cache.
The method according to claim 1, characterized in that, after caching the user file to a storage area of a corresponding level, it further includes:

Receive IO requests for user files;

Match the requested user file in the IO request with a file distribution hash table to determine the storage location of the requested user file on the cloud host. The file distribution hash table includes a mapping relationship between user files and cache locations. ;

If it is determined that the requested user file is not stored on the cloud host, obtain the requested user file through the associated data source on the cloud host;

Cache the requested user file to the storage area of the corresponding level, and update the mapping relationship between the requested user file and the storage area of the corresponding level into the file distribution hash table;

Send the requested user file to the sender of the IO request.
A data caching system on the cloud, characterized in that the system includes a data source layer, a cache layer on a cloud host, a cache management and control layer, a cache client on the cloud host and an HPC processing end, wherein:

The data source layer includes cloud low-frequency file storage, cloud object storage and IDC file storage;

The cache layer includes a file system, distributed memory and virtual disk mounted on the cloud host. The cache layer is used to cache frequently accessed user files;

The cache management and control layer includes a cache configuration center, a file access characteristic statistics table and a file distribution hash table. The cache management and control layer is used to manage cached user files;

Wherein, the cache client is used to provide a data operation interface for the HPC processing layer and process IO requests.
The system according to claim 8, characterized in that the data source layer includes:

Cache low-frequency access user files into the low-frequency file storage on the cloud, where the low-frequency access user files are user files whose access frequency is lower than a preset frequency.
The system according to claim 8, characterized in that the caching layer includes:

Among the frequently accessed user files obtained from the data source layer, cached user files whose data volume is not less than the preset space threshold are stored in the distributed memory;

Among the frequently accessed user files obtained from the data source layer, the user files whose data volume is lower than the preset space threshold and whose modification frequency is lower than the preset frequency are stored in the virtual disk;

Among the frequently accessed user files obtained from the data source layer, the user files whose data volume is lower than the preset space threshold and whose modification frequency is not lower than the preset frequency are stored in the file mounted on the cloud host. in the file system.
The system according to claim 8, characterized in that the cache management and control layer includes:

The cache configuration center is used to maintain cache configuration information and provide cache control interfaces to users;

The file access characteristic statistics table is used to collect file access characteristics of the HPC processing layer, including file access characteristics including file access popularity, access mode, and data volume of user files.
A data caching device on the cloud, characterized in that the device includes:

Deployment module, used to cache user files through the distributed memory of the cloud host, the virtual disk, and the file system mounted on the cloud host;

A response module configured to, when obtaining a user file to be cached, determine the hierarchical characteristics corresponding to the user file, and cache the user file to a storage area of a corresponding level based on the hierarchical characteristics, where the hierarchical characteristics include At least one of access frequency, modification frequency and data volume, the different levels of storage areas include distributed memory of the cloud host, virtual disks and file systems mounted on the cloud host.
An electronic device, including a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor runs the computer program to implement any one of 1-11 method described in the item.
A computer-readable storage medium on which a computer program is stored, characterized in that the program is executed by a processor to implement the method described in any one of 1-11.