WO2023029610A1

WO2023029610A1 - Data access method and device, and storage medium

Info

Publication number: WO2023029610A1
Application number: PCT/CN2022/095010
Authority: WO
Inventors: 李秀桥; 孙宏伟; 丁肇辉; 高帅; 江喆; 陈强
Original assignee: 超聚变数字技术有限公司
Priority date: 2021-08-31
Filing date: 2022-05-25
Publication date: 2023-03-09
Also published as: CN115729438A

Abstract

A data access method, comprising: scheduling a cache resource for a first application according to a cache policy submitted by a user for the first application; pre-fetching data into the cache resource for the first application according to mapping directory information submitted by the user; and subsequently, during running of the first application, accessing the cache resource for the first application according to the cache policy. Thus, a requirement of a user is sensed, and resource usage of the application is controlled according to the requirement of the user, thereby improving the application performance.

Description

Data access method, device and storage medium

This application claims the priority of the Chinese patent application with the application number 202111010014.8 and the application name "data access method, device and storage medium" submitted to the China Patent Office on August 31, 2021, the entire contents of which are incorporated by reference in this application middle.

technical field

The present application relates to the technical field of data processing, in particular to a data access method, device and storage medium.

Background technique

In the data center, a large-scale cluster system is usually used to provide a shared application execution environment for multiple users. The cluster system usually includes a management node and multiple computing nodes. Wherein, for any application to be run, the management node may assign a corresponding computing node to the application, and then the computing node runs the application. In this way, the user cannot control the resource usage of the computing node by the application, thus causing the application performance to fail to meet user requirements.

Contents of the invention

The present application provides a data access method, device, and storage medium, which can control application resource usage according to user requirements, so that application performance meets user requirements. Described technical scheme is as follows:

In a first aspect, a data access method is provided, the method includes: receiving a cache configuration request for a first application submitted by a user, the cache configuration request includes cache policy and mapping directory information, and the cache policy is used to indicate For the caching requirements of the first application, the mapping directory information is the information of the first directory where the application data of the first application stored in the storage system is located; according to the caching policy, scheduling caching for the first application resource; according to the mapping directory information, prefetch the application data of the first application into the cache resource of the first application; during the running of the first application, access the cache resource according to the cache policy Cache resources of the first application.

It can be known from the above description that the cache resource is scheduled for the first application according to the cache policy submitted by the user for the first application, and the data is prefetched into the cache resource of the first application according to the mapping directory information submitted by the user. Subsequently, during the running of the first application, the cache resource of the first application is accessed according to the cache policy. It can be seen that the data access method provided by the present application can perceive the user's needs, and then control the resource usage of the application according to the user's needs, thereby improving the application performance.

In a possible implementation manner, the implementation process of scheduling cache resources for the first application according to the cache policy includes: determining each of the multiple tasks of the first application according to the cache policy The resource requirement information includes the size of the cache space required by each task and the type of the included storage medium; according to the resource requirement information of each task, the cache space is allocated for each task.

In this application, the first application can be divided into multiple tasks. In this way, the user can specify the resource requirement information of each task of the first application to control the cluster system to allocate cache for the corresponding task according to the resource requirement information of each task. space. Wherein, the resource requirement information of different tasks may be the same or different.

In a possible implementation manner, the mapping directory information includes a directory path of the first directory, and according to the mapping directory information, the application data of the first application is prefetched to the first application The implementation process in the cache resource includes: determining the directory identifier of the subdirectory corresponding to each task in the plurality of tasks of the first application; according to the directory path of the first directory and the directory of the subdirectory corresponding to each task Identifying, obtaining from the storage system the data in the subdirectory corresponding to each task stored in the first directory; storing the data in the subdirectory corresponding to each task in the cache resource of the first application .

The data of the first application is prefetched into the cache resource allocated for the first application through the mapping directory information specified by the user. In this way, the user does not need to perform explicit data copying, which reduces the complexity of the user's use.

In a possible implementation manner, the implementing process of accessing the cache resource of the first application according to the cache policy includes: when the cache policy includes a hierarchical cache policy, according to the hierarchical cache policy, The different types of task data of each task of the first application are cached in the storage medium of the corresponding type in the cache resource of the first application; when the cache policy includes a data consistency policy, when accessing any cache When accessing task data in the space, lock operation is performed on the accessed task data.

In this application, data can be cached and accessed in the cache resource of the first application according to the data cache and access policy in the cache policy specified by the user, for example, data can be cached according to the hierarchical cache policy, so as to improve data access performance and save Resource consumption of cache space. The data is accessed according to the data consistency policy to ensure the accuracy of the data during the data access process. In addition, users can flexibly customize other policies to achieve flexible settings of data caching and access methods.

In a possible implementation manner, before accessing the cache resource of the first application according to the cache policy, the access further includes: obtaining an input/output IO request; if the data accessed by the IO request is the mapping directory data in the first directory indicated by the information, then perform the step of accessing the cache resource of the first application according to the cache policy.

In this application, by setting the mapping directory information, the access to the first directory indicated by the mapping directory information can be directly intercepted, and then the data access can be realized by accessing the cache resource allocated for the first application, which improves the access efficiency, and The user is not aware of the whole process.

In a possible implementation manner, the method further includes: acquiring the bandwidth requirements of the data to be migrated to the storage system in the cache resources of each of the multiple applications, the multiple applications including the first application; according to the bandwidth requirement, allocate IO bandwidth for the data to be migrated in the cache resource of the first application; according to the IO bandwidth, store the data to be migrated in the cache resource of the first application to the in the storage system.

In this application, when each computing node detects that the amount of data in the cache resources allocated by itself for the application reaches the second threshold, it may apply to the management node for allocating the bandwidth requirements of the data to be migrated for the application it is running. According to the collected bandwidth requirements of each computing node, the management node can allocate the IO bandwidth for migrating the data to be migrated to the application running on each computing node, so as to control the data volume of each computing node migrating data to the storage system, thereby avoiding The application data access volume of different computing nodes exceeds the available bandwidth of the storage system, resulting in I/O bandwidth competition. This enables applications in the global view to access the storage system in an orderly manner, thereby reducing application performance problems caused by IO competition. Moreover, in the embodiment of the present application, according to the mapping directory information specified by the user, the cluster system can automatically complete the data copy without the user's own operation to complete the data copy, which reduces the complexity of the user's operation.

In a second aspect, a data access method is provided, the method includes: a management node receives a cache configuration request of a first application submitted by a user, the cache configuration request includes a cache policy and mapping directory information, and the cache policy is used for Indicating the caching requirements of the first application, the mapping directory information is the information of the first directory where the application data of the first application stored in the storage system is located; the management node calculates from the target according to the caching policy A cache resource is scheduled for the first application in the node; the management node controls the target computing node to prefetch the application data of the first application into the cache resource of the first application according to the mapping directory information , and through the cache policy, control the target computing node to access the cache resources of the first application during the running process of the first application.

In this application, the management node schedules cache resources for the first application according to the cache policy submitted by the user for the first application, and controls the computing node to prefetch data into the cache resources of the first application according to the mapping directory information submitted by the user. Subsequently, during the running of the first application, the control computing node accesses the cache resource of the first application according to the cache policy. It can be seen that, the embodiment of the present application can sense the user's demand, and then control the resource usage of the application according to the user's demand, thereby improving the application performance.

In a possible implementation manner, the management node schedules cache resources for the first application from the target computing node according to the cache policy, including: the management node obtains the first application from the cache policy. The resource requirement information of each task among the multiple tasks of an application, the resource requirement information includes the size of the buffer space required by each task and the type of the storage medium included; the management node allocates according to the resource requirement information A target computing node that executes each task of the first application; the management node sends the caching policy to the target computing node to instruct the target computing node to, according to the resource requirement information in the caching policy, from itself Allocate cache space for the corresponding tasks in the cache space.

In this application, the management node can control the resource usage of the computing node by the first application according to the cache policy specified by the user, so as to realize the user's control of the resource usage of the computing node, so that the application performance of the first application can be Better meet the needs of users.

In a possible implementation manner, the mapping directory information includes a directory path of the first directory, and the management node controls the target computing node to store the application data of the first application according to the mapping directory information The implementation process of prefetching to the cache resource of the first application may be: the management node sends the directory path of the first directory to the target computing node, so as to instruct the target computing node to The directory path of the directory is to prefetch the data under the subdirectory of each task stored under the first directory from the storage system, and store the obtained data into the cache resource of the first application.

In this application, the management node controls the computing node to prefetch the data of the first application into the cache resources allocated for the first application through the mapping directory information specified by the user. In this way, the user does not need to perform explicit data copying, reducing the user complexity.

In a possible implementation manner, the method further includes: the management node receiving the bandwidth requirements of the data to be migrated to the storage system in the cache resources of each application sent by the multiple computing nodes, and the multiple computing nodes The node includes the target computing node; according to the bandwidth requirement, allocate IO bandwidth for the data to be migrated in the cache resource of the first application; send the IO bandwidth allocated for the first application to the target computing node, Instructing the target computing node to store the data to be migrated in the cache resource of the first application in the storage system according to the IO bandwidth allocated for the first application.

In this application, the management node can allocate the IO bandwidth for migrating the data to be migrated to the application running on each computing node according to the collected bandwidth requirements of each computing node, so as to control the process of migrating data from each computing node to the storage system. The amount of data, so as to prevent the application data access volume of different computing nodes from exceeding the available bandwidth of the storage system, resulting in I/O bandwidth competition, and realize orderly access to the storage system by applications in the global view, thereby reducing application performance caused by IO competition question.

In a third aspect, a data access method is provided, the method includes: a computing node receives a caching policy and mapping directory information of a first application specified by a user, the caching policy is used to indicate a caching requirement of the first application, The mapping directory information is the information of the first directory where the application data of the first application stored in the storage system is located; according to the cache policy, allocate cache resources for the first application, and according to the mapping directory information, Prefetch the application data of the first application into the cache resource of the first application; and access the cache resource of the first application according to the cache policy during the running of the first application.

In this application, the computing node can allocate corresponding cache resources for the first application according to the caching policy of the first application specified by the user, so that the resource usage of the computing node by the first application meets the user's requirements, and then the first application's Application performance can meet user needs. On this basis, in the process of running the first application, data access can be directly performed in the cache resources allocated by the computing node for the first application, reducing the access to the storage system, thereby reducing the communication between computing nodes. compete. Moreover, after allocating cache resources for the first application, the computing node can prefetch the application data of the first application under the first directory in the storage system into the first cache space according to the mapping directory information, without manual data copying by the user. Reduced operational complexity.

In a possible implementation manner, when allocating cache resources for the first application according to the cache policy, the computing node obtains each of the multiple tasks of the first application from the cache policy The resource requirement information of the task, the resource requirement information includes the size of the cache space required by each task and the type of storage medium included; A cache space is allocated to a first task, where the first task is any one of the multiple tasks running on the computing node.

In a possible implementation manner, the implementation process of prefetching the application data of the first application into the cache resource of the first application according to the mapping directory information includes: determining that the first task The directory identifier of the corresponding subdirectory; according to the directory path of the first directory and the directory identifier of the subdirectory corresponding to the first task, obtain the first task corresponding to the first task stored in the first directory from the storage system the data under the subdirectory; storing the data under the subdirectory corresponding to the first task in the cache space of the first task.

In a possible implementation manner, if the cache policy includes a hierarchical cache policy, the implementation process of storing the data in the subdirectory corresponding to the first task in the cache space of the first task includes: according to the According to the data type of the data in the subdirectory corresponding to the first task, different data are stored in different types of storage media.

In this application, data is cached according to a hierarchical cache policy, so that different types of data can be stored in appropriate storage media, thereby improving data access performance and saving resource consumption of cache space.

In a possible implementation manner, in the process of running the first application, according to the cache policy, the implementation process of accessing the cache resources of the first application may include: obtaining an IO request, if the The data accessed by the IO request is the data under the first directory indicated by the mapping directory information, and the cache resource of the first application is accessed according to the IO request and the cache policy.

In a possible implementation manner, if the cache policy includes a data consistency policy, when the computing node accesses data in the cache resource of the first application, lock the accessed data Operation, in order to ensure the accuracy of the data in the data access process.

In a possible implementation manner, when the computing node detects that the amount of data in the cache resource allocated by itself for the first application reaches a reference threshold, it sends the bandwidth requirement of the first application to the management node, so The bandwidth requirement is used to indicate the bandwidth required for migrating the data to be migrated in the cache resource of the first application on the computing node to the storage system; According to the allocated IO bandwidth of the data to be migrated, the data to be migrated in the cache resource of the first application is migrated to the storage system.

In this application, the computing node can request the management node to allocate IO bandwidth for the first application by sending the bandwidth requirement of the first application to the management node. Since the management node can collect the bandwidth requirements of each computing node at the same time, according to the management node When the allocated IO bandwidth is used to migrate data, each computing node can generate I/O bandwidth competition, realizing orderly access to the storage system by applications in the global view, thereby reducing application performance problems caused by IO competition.

In a fourth aspect, a data access device is provided, and the data access device has the function of implementing the behavior of the data access method in the first aspect above. The data access device includes at least one module, and the at least one module is used to implement the data access method provided in the first aspect above.

In the fifth aspect, a data access device is provided, the data access device has the function of realizing the behavior of the data access method in the second aspect above, and the data access device includes at least one module, and the at least one module is used to realize the above first The data access method provided by the second aspect.

In the sixth aspect, a data access device is provided, the data access device has the function of realizing the behavior of the data access method in the above third aspect, and the data access device includes at least one module, and the at least one module is used to realize the above-mentioned first Data access methods provided by three aspects.

In a seventh aspect, a cluster system is provided, the cluster system includes a management node and a computing node, the management node and the computing node both include a processor and a memory, and the memory is used to store and support the execution of the cluster system The program of the data access method provided by the first aspect above, and storing the data involved in implementing the data access method provided by the first aspect above. The processor is configured to execute programs stored in the memory.

The eighth aspect provides a management node, the structure of the management node includes a processor and a memory, and the memory is used to store a program that supports the management node to execute the data access method provided in the second aspect above, and The data involved in implementing the data access method provided by the second aspect above is stored. The processor is configured to execute programs stored in the memory.

A ninth aspect provides a computing node, the structure of the computing node includes a processor and a memory, and the memory is used to store a program that supports the computing node to execute the data access method provided in the third aspect above, and The data involved in implementing the data access method provided by the above third aspect is stored. The processor is configured to execute programs stored in the memory.

In a tenth aspect, a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect or the second aspect or the third aspect. The data access methods described above.

In an eleventh aspect, a computer program product containing instructions is provided, which, when run on a computer, causes the computer to execute the data access method described in the first aspect or the second aspect or the third aspect.

Description of drawings

FIG. 1 is a system architecture diagram of a data center provided in an embodiment of the present application;

Fig. 2 is a method flowchart of a computer device provided by an embodiment of the present application;

FIG. 3 is a flow chart of a data access method provided by an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a data access device provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of another data access device provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another data access device provided by an embodiment of the present application.

Detailed ways

For ease of understanding, the system architecture involved in the embodiment of the present application is firstly introduced.

The data access method provided by the present application can be applied to a data center, and the data center can provide a shared application execution environment for multiple users. Wherein, the applications running in the data center may be data-intensive applications such as high-performance computing applications and big data applications.

For example, referring to FIG. 1 , the data center includes a cluster system 10 and a storage system 11 , and a communication connection is established between the cluster system 10 and the storage system 11 . Wherein, the cluster system 10 is used to provide an execution environment for multiple applications, and the storage system 11 is used to store application data of the multiple applications.

Referring to FIG. 1, the cluster system 10 may include a management node 101 and a plurality of computing nodes 102, and the management node 101 and each computing node 102 may communicate through a wired network or a wireless network, and each computing node 102 is also Communication can be performed via a wired network or a wireless network. In the embodiment of the present application, the management node 101 is used to assign a computing node 102 for executing the application to the application according to the caching policy specified by the user, and combine the caching policy and the mapping directory information specified by the user sent to the computing node 102.

After receiving the cache policy and mapping directory information specified by the user from the management node 101, the computing node 102 allocates cache resources for the application from its own cache resources according to the cache policy, and according to the mapping directory information, the storage system The data of the application stored in step 11 is prefetched into the cache resource of the application. After that, run the application, and access the cache resources of the application according to the cache policy during the running of the application. Wherein, the cache resource of the computing node 102 itself refers to a storage medium included in the computing node 102 . Exemplarily, the cache resource of the computing node 102 itself may include its own dynamic random access memory (dynamic random access memory, DRAM), large-capacity storage class memory (storage class memory, SCM), solid-state hard disk (solid storage disk, SSD) ) and other types of storage media, which are not limited in this embodiment of the present application.

It should be noted that the number of computing nodes 102 that the management node 102 assigns to the application to execute the application can be multiple, so that each computing node 102 can be used to run one or more tasks of the application.

Through the above method, the management node 101 can schedule cache resources in the computing nodes 102 for various applications to be run by different users according to user requirements, thereby controlling the corresponding computing nodes 102 to run corresponding applications.

When the computing node 102 is running an application, after detecting that the amount of cached data in the cache resource of a certain application reaches the second threshold, the computing node 102 may send the cache resource of the application to be migrated to the storage system to the management node 101 11 to request the management node 101 to allocate IO bandwidth for the application data to be migrated.

After receiving the bandwidth requirements of each application sent by one or more computing nodes 102, the management node 101 can allocate IO bandwidth for the data to be migrated of each application according to the bandwidth requirement, and deliver the allocated IO bandwidth to the corresponding computing nodes. Node 102. Correspondingly, after receiving the IO bandwidth allocated by the management node 101 for the application data to be migrated, the computing node 102 may send the application data to be migrated to the storage system 11 for storage according to the IO bandwidth.

Wherein, the storage system 11 includes multiple storage nodes 111 . Wherein, wired or wireless communication may be performed between each storage node 111 and each computing node 102 . Each storage node 111 is used to receive the IO request of the computing node 102, wherein, when the IO request is a read request sent by the computing node 102 according to the mapping directory information, the storage node 111 obtains application data according to the read request and returns it to the computing node 102 , so that the computing node 102 caches the data of the application in the cache resource allocated for the application. When the IO request is a write request carrying application data to be migrated, the storage node 111 may persistently store the data to be migrated according to the write request.

It should be noted that, in a possible implementation manner, the storage node 111 may include a control unit, a network card, and multiple storage devices. Wherein, the control unit is used for communicating with the computing node 102 through the network card, and accessing multiple storage devices according to the IO request of the computing node 102 . Wherein, the plurality of storage devices may include large-capacity storage class memory (storage class memory, SCM), solid storage disk (solid storage disk, SSD) and other types of storage devices, which is not limited in this embodiment of the present application.

Optionally, in this embodiment of the application, the data center may also provide users with a login node for submitting cache policies and mapping directory information. The user submits the cache policy and mapping directory information of the application to be run to the management node 101 through the login node, so that the management node 101 schedules resources for the application according to the application cache policy and mapping directory information.

Each of the above-mentioned management node 101 , computing node 102 , storage node 111 and login node may be a separate computer device. Wherein, the login node may be a terminal device, such as a laptop computer, a desktop computer, a tablet computer, a smart phone, and the like. The management node 101 and the computing node 102 may be terminal devices or servers. The storage node 111 may be a server.

Fig. 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application. Both the management node and the computing node in the system architecture shown in FIG. 1 can be realized by the computer device. Referring to FIG. 2 , the computer device may include one or more processors 201 , a communication bus 202 , a main memory 203 and one or more communication interfaces 204 .

The processor 201 may be a general-purpose central processing unit (central processing unit, CPU), a network processor (network processor, NP), a microprocessor, or may be one or more integrated circuits for realizing the scheme of the present application, such as , application-specific integrated circuit (ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof. The aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.

The communication bus 202 is used to transfer information between the aforementioned components. The communication bus 202 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

The main memory 203 can be a read-only main memory (read-only memory, ROM), or a random access memory (random access memory, RAM), or can be used to carry or store desired data in the form of instructions or data structures. program code and any other medium that can be accessed by a computer, but is not limited thereto. Wherein, when the main memory 203 is RAM, the main memory may be Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), or SCM, etc. The main memory 203 may exist independently, and is connected to the processor 201 through the communication bus 202 . The main memory 203 can also be integrated with the processor 201 .

Communication interface 204 utilizes any transceiver-like device for communicating with other devices or a communication network. The communication interface 204 includes a wired communication interface, and may also include a wireless communication interface. Wherein, the wired communication interface may be an Ethernet interface, for example. The Ethernet interface can be an optical interface, an electrical interface or a combination thereof. The wireless communication interface may be a wireless local area network (wireless local area networks, WLAN) interface, a cellular network communication interface or a combination thereof.

In some embodiments, the computer device may further include other storage media 205, for example, the other storage media 205 may include a mechanical hard disk, a solid state hard disk, and the like.

In some embodiments, a computer device may include multiple processors, such as processor 201 and processor 206 as shown in FIG. 2 . Each of these processors can be a single-core processor or a multi-core processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data such as computer program instructions.

In a specific implementation, as an embodiment, the computer device may further include an output device 207 and an input device 208 . Output device 207 is in communication with processor 201 and can display information in a variety of ways. For example, the output device 207 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector), etc. . The input device 208 communicates with the processor 201 and can receive user input in various ways. For example, the input device 208 may be a mouse, a keyboard, a touch screen device, or a sensory device, among others.

In some embodiments, the main memory 203 is used to store a kernel, program codes for executing the solution of the present application, and other instructions and data, and the processor 201 can execute the program codes stored in the main memory 203 . The program code may include one or more software modules, and the computer device may implement the data access method provided in the embodiment of FIG. 3 below through the program code in the processor 201 and the main memory 203 .

In the data access method provided by this application, the user can flexibly customize the cache policy for the application, and the cluster system can schedule cache resources for the application according to the cache policy submitted by the user, that is, the cluster system can sense the user's needs, and then according to the user's demand To control the resource usage of the application, so as to achieve the effect of improving application performance. In addition, the cluster system can prefetch data into the cache resource of the application according to the mapping directory information submitted by the user, so as to improve the data access speed and reduce the complexity of the user's use. In addition, the management node in the cluster system can allocate IO bandwidth for applications running on the computing nodes by collecting the bandwidth requirements of each computing node, reducing application performance problems caused by IO competition. The embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings.

Fig. 3 is a flow chart of a data access method provided by an embodiment of the present application. This method can be applied to the cluster system in the data center shown in Figure 1, referring to Figure 3, the method includes the following steps:

Step 301: The management node receives a cache configuration request for the first application submitted by the user. The cache configuration request includes a cache policy and mapping directory information. The cache policy is used to indicate the cache requirements of the first application, and the mapping directory information is stored in the storage system. Information about the first directory where the application data of the first application is located.

In the embodiment of the present application, the user inputs the caching policy and mapping directory information for the first application on the login node. The login node generates a cache configuration request for the first application according to the cache policy and mapping directory information input by the user, and sends the cache configuration request to the management node. Wherein, the cache configuration request carries the cache policy and mapping directory information, and the first application refers to the application to be run by the user. Correspondingly, the management node receives the cache configuration request of the first application sent by the login node.

Exemplarily, a command line tool is deployed on the login node, and the user may input the caching policy and mapping directory information of the first application in the command line interface of the command line tool displayed on the login node. The login node acquires the cache policy and mapping directory information of the first application input by the user in the command line interface.

Optionally, a service configuration client may also be deployed on the login node, and the user may input the cache policy and mapping directory information of the first application in the interface of the service configuration client displayed on the login node. Correspondingly, the login node can obtain the cache policy and mapping directory information of the first application through the service configuration client, and then generate a cache configuration request of the first application.

It should be noted that the caching policy of the first application may include resource requirement information and data caching and access policies of the first application.

In the embodiment of the present application, the first application may be divided into multiple tasks to be executed by multiple computing nodes. In this case, the resource requirement information of the first application specified by the user may include resource requirement information of each task of the first application. Moreover, the resource requirement information of each task of the first application may be the same or different. Wherein, the resource requirement information may include computing resource requirement information and cache resource requirement information. The computing resource requirement information is used to indicate the computing resources required by each task of the first application, for example, the number of cores and main frequency of the processor required to run each task of the first application. The cache resource requirement information includes the size of the cache space required by each task of the first application. In addition, it may also include the type of storage medium included in the cache space required by each task of the first application. For example, the first application The cache space required by each task may include two different storage media, DRAM and SCM. Optionally, the cache resource requirement information may also include a topology structure of the cache space required by each task of the first application, that is, a topology structure of storage media at various levels constituting the cache space on corresponding computing nodes.

Optionally, the resource requirement information of the first application may also be directly used to indicate the resource requirement of the first application, that is, the resource requirement information is not the above-mentioned resource requirement information at the task granularity level, but resource requirements at the application granularity level demand information.

The data cache and access policy may be used to indicate the cache mode and access policy for the application data of the first application. For example, the data cache and access policy may include a hierarchical cache policy for instructing to cache different types of application data of the first application in different types of storage media. For another example, the data cache and access policy may include a data consistency policy, which is used to indicate that when accessing any data in the cache resource of the first application, a lock operation is performed on the accessed data to ensure data consistency . For another example, the data caching and access policy may also include a security level policy, which is used to indicate the access rights of the data in the cache resources of the application. The above are only some strategies that may be included in the data cache and access strategies given in the embodiment of this application. In addition, the data cache and access strategies may also include other strategies flexibly customized by users to better meet User needs, improve application performance.

In addition, the mapping directory information refers to information of the first directory where the application data of the first application stored in the storage system is located. Exemplarily, the mapped directory information may be a directory path of the first directory in the storage system. Alternatively, the mapping directory information may also be other information that can be used to indicate the storage location of the application data of the first application in the storage system, which is not limited in this embodiment of the present application.

Step 302: The management node allocates a target computing node for executing the first application according to the caching policy.

After receiving the cache configuration request of the first application, the management node allocates a target computing node for executing the first application from multiple computing nodes according to the resource requirement information included in the cache policy in the cache configuration request.

Exemplarily, if the cache policy includes resource requirement information at the task granularity level, the management node may acquire the resource requirement information of each of the multiple tasks of the first application from the cache strategy, and allocate A target computing node for executing each task of the first application.

It can be seen from the introduction in step 301 that the resource requirement information of each task may include computing resource requirement information and cache resource requirement information of each task. Based on this, the management node can collect and update the usage of computing resources and cache resources of each computing node in real time, based on the computing resource demand information of each task and the latest update of the usage of computing resources of each computing node. Determine a candidate computing node that can meet the computing resource requirements required by the task running of the first application among the computing nodes. Afterwards, according to the cache resource requirement information of each task and the usage of the cache resource of each candidate computing node updated last time, further determine the computing node that can meet the cache resource requirement of the task of the first application from the candidate computing nodes, and set The finally determined computing node is used as the target computing node.

For example, the management node can determine the remaining computing resources on each computing node according to the resources occupied by the applications running on each computing node last updated, and then determine from multiple computing nodes that the remaining computing resources meet the tasks of the first application Candidate computing nodes for computing resource requirements. Afterwards, according to the size of the remaining cache space of each candidate computing node updated last time and the type of the storage medium forming the remaining cache space, determine from the candidate computing nodes that the remaining cache space is greater than the size of the cache space required by the task of the first application, And the remaining cache space includes the computing nodes of the storage medium required by the task of the first application, so as to obtain the target computing node.

Optionally, the management node may also determine candidate computing nodes from multiple computing nodes according to the cache resource requirement information of each task, and then determine from the candidate computing nodes according to the computing resource requirement information of each task of the first application. The target computing node will not be described again in this embodiment of the present application.

It should be noted that, through the above method, the management node can determine the computing node running each task of the first application. Wherein, the computing nodes running each task can be different, so there will be multiple target computing nodes. Optionally, the computing nodes running each task may also be the same target computing node, so there will be one target computing node. Or, some tasks can be executed by one target computing node, and some tasks can be executed by another target computing node, so there will be multiple target computing nodes.

In another implementation, if the caching policy includes resource requirement information at the application granularity level, the management node may directly assign a target computing node to the first application according to the resource requirement information of the first application. The implementation may be Referring to the implementation manner of allocating target computing nodes for each task in the foregoing, details are not described in this embodiment of the present application.

Optionally, the management node can also determine the resource requirement information of each of the multiple tasks of the first application according to the resource requirement information of the first application and the task division principle of the first application, and calculate The node method assigns a target compute node for each task.

Step 303: the management node sends the cache policy and mapping directory information to the target computing node.

After determining the target computing node that executes each task of the first application, the management node can deliver the caching policy and mapping directory information of each task of the first application to the target computing node, so as to control the target computing node to serve the first application according to the caching policy. Each task of an application allocates cache resources, and accesses the cache resources of the first application according to the mapping directory information and the cache policy.

Optionally, when there are multiple target computing nodes and each task has the same cache policy, the management node may deliver the cache policy and mapping directory information to each target computing node. When there are multiple target computing nodes and the resource requirement information of each task contained in the cache policy is different, the management node can use the data cache and access policy contained in the cache policy and the resource requirement information of each task as the corresponding task cache policy, and then deliver the mapping directory information and the cache policy of each task to the target computing node corresponding to the task, wherein the target computing node corresponding to the task runs the target computing node of the task.

Optionally, while delivering the cache policy and mapping directory information to the target computing nodes, the management node can also deliver the identifier of the task to be run to each target computing node, so as to indicate that the target computing node is to run the first Which task to apply. Wherein, the identifier of the task can uniquely identify the task.

After each target computing node receives the cache policy and mapping directory information sent by the management node, it can run the first application through the following steps 304 to 306 .

Step 304: The target computing node allocates cache resources for the first application according to the cache policy.

After receiving the cache policy and mapping directory information issued by the management node, the target computing node first allocates cache resources for the first application according to the cache policy.

Wherein, the target computing node may obtain cache resource requirement information from the received cache policy, and then allocate cache resource for the task of the first application to be executed by itself according to the cache resource requirement information. Next, a target computing node is used as an example for description. For convenience of description, this target computing node is referred to as a first target computing node.

Exemplarily, for example, if there is a first task running on the first target computing node, the first target computing node obtains the cache resource requirement information of the first task from the received cache policy, and then according to the cache resource requirement information of the first task information, and allocate a cache space that meets the requirements of the cache resource for the first task in its own cache resource.

Optionally, when the cache resource requirement information of each task is the same and a task of the first application is run on a target computing node, the first target computing node may allocate the cache resources corresponding to the cache resource according to the cache resource requirement information. The cache space of the same size indicated by the cache resource requirement information is used as the cache space of the first application. At this time, the allocated cache space of the first application may be the cache of the first task of the first application running on the first target computing node. The space, that is, for storing task data of the first task, may also be a cache space for other tasks of the first application running on other target computing nodes, that is, for storing task data of other tasks.

Each target computing node allocates cache space for each task of the first application according to the cache policy issued by the management node, so that the cache space of each task of the first application located on each target computing node constitutes the cache resource of the first application .

Optionally, if the resource requirement information in the cache policy is resource requirement information at the application granularity level, and the management node directly delivers the resource requirement information of the first application to the target computing node, then the target computing node will have one, In this case, after receiving the cache resource requirement information of the first application, the target computing node may allocate cache space for the first application from its own cache resources according to the cache resource requirement information. In this way, the cache resource of the first application will be located on one computing node.

Step 305: The target computing node prefetches the application data of the first application into the cache resource of the first application according to the mapping directory information.

After allocating the corresponding cache resource for the first application, the target computing node may obtain the application data of the first application from the storage system according to the mapping directory information, and then cache the application data in the cache resource of the first application. Next, the first target computing node is still taken as an example for illustration.

In the first possible situation, if the first target computing node allocates cache space for the first task of the first application executed by itself, and the mapping directory information is the directory path of the first directory in the storage system, then the first The target computing node may determine the directory identifier of the subdirectory corresponding to the first task, and obtain the first task stored in the first directory from the storage system according to the directory path of the first directory and the directory identifier of the subdirectory corresponding to the first task. The data under the corresponding subdirectory, and then store the acquired data under the subdirectory corresponding to the first task in the cache space allocated for the first task.

Wherein, the first target computing node may obtain the directory identifier of the subdirectory corresponding to the first task from preset task identifiers and directory identifiers of subdirectories according to the task identifier of the first task. Alternatively, the first target computing node may also generate the directory identifier of the subdirectory corresponding to the first task by using a preset rule according to the task identifier of the first task. For example, the task number of the first task is 1, and the default rule for generating the directory identifier of the subdirectory corresponding to the task is processor+task number, then according to the preset rule, the directory identifier of the subdirectory corresponding to the first task can be obtained as processor 1 .

Afterwards, the first target computing node can obtain the data under the subdirectory with the same directory identifier as the subdirectory corresponding to the first task from the first directory stored in the storage system according to the directory path of the first directory, that is, the first The task data of the task, and then, store the task data of the first task in the cache space of the first task.

It should be noted that when storing the task data of the first task in the cache space of the first task, if the cache policy also includes a data cache and access policy, and the data cache and access policy includes a hierarchical cache policy, the first The target computing node may also store different data in different types of storage media according to the data type of the task data of the first task.

For example, among the task data of the first task, the hotspot data whose access frequency is higher than the first threshold is stored in a memory with higher performance (also referred to as a storage medium), and the data with a lower access frequency is stored in a memory with a relatively lower performance. weak storage media. For example, metadata and other data other than metadata may be stored in different types of storage media. Wherein, the first threshold may be set according to business requirements, may be set according to task data processing efficiency, may also be an experience value, or may be set according to system processing capability.

In addition, there may be one or multiple tasks of the first application executed by the above-mentioned first target computing node. When the first target computing node executes multiple tasks of the first application, each The data under the subdirectory corresponding to the task to be executed is prefetched into the cache space corresponding to the corresponding task.

In the second possible situation, if the cache resource requirement information of each task is the same, and each target computing node allocates the same cache space from its own cache resource as the space indicated by the cache resource requirement information, then the first The target computing node may obtain the data under the first directory indicated by the mapping directory information from the storage system, and then perform a hash operation on the directory path of the obtained data to obtain a hash value corresponding to the data. A target computing node whose node ID matches the hash value is determined from among the plurality of target computing nodes. If the node identifies the target computing node matching the hash value as itself, then store the data in the cache space allocated by itself for the first application. If the node identifies that the target computing node matching the hash value is another target computing node, for example, the second target computing node, the first target computing node may send the data to the second target computing node, and the second target computing node After receiving the data, the target computing node stores the data in the cache space allocated by itself for the first application.

Wherein, when storing the data in the cache space allocated for the first application, you can also refer to the method described above, and store the data in the corresponding type of storage medium according to the hierarchical cache policy included in the data cache and access policy .

In the third possible situation, if the target computing node allocates cache space for the first application according to the cache resource information of the first application at the application granularity level, the target computing node can directly obtain the mapping directory information from the storage system The data under the indicated first directory is stored in the cache space allocated for the first application.

The foregoing are several possible implementation manners of prefetching application data of the first application given in the embodiments of the present application. Optionally, the target computing node may also use a multi-copy mechanism to prefetch task data of each task of the first application, or use other implementation manners to prefetch data of the first application, which is not limited in this embodiment of the present application.

In addition, it should be noted that the data prefetched from the first directory of the storage system may be all data in the first directory, or may be part of the data in the first directory. For each task, what is prefetched may be all data of the task, or part of the data of the task, which is not limited in this embodiment of the present application. Wherein, when prefetching is partial data, more important data may be prefetched according to the access frequency of the data or other information that can indicate the importance of the data.

Step 306: During the running of the first application, the target computing node accesses the cache resource of the first application according to the cache policy.

After allocating cache resources for the first application through steps 304 and 305, and prefetching the application data of the first application into the cache resources of the first application, the target computing node starts the running script of the first application to start running the first application. application.

Wherein, the first target computing node is still taken as an example for illustration. The first target computing node starts the running script of the first application, and executes the first task of the first application assigned to itself.

During the execution of the first task, the first target computing node may need to read the application data of the first application, or write the data generated during the execution of the task into the cache resource of the first application. Based on this, the first target computing node may generate an IO request according to an operation to be performed, where the IO request may be a read request or a write request. Moreover, the IO request may include a directory path of a directory where the accessed target data is located.

After obtaining the IO request, the first target computing node can first compare the directory path of the accessed target data with the mapping directory information, and if the directory path of the accessed target data contains the mapping directory information, it can be determined that the The target data to be accessed is the data under the first directory. In this case, since the data in the first directory is prefetched into the cache resource of the first application in step 305, the first target computing node can directly access the cache resource of the first application.

Wherein, if each target computing node is the data prefetched by the method described in the first possible situation in step 305, after determining that the data to be accessed by the IO request is the data under the first directory, the first target computing node The cache resource of the first application is accessed according to the IO request.

It should be noted that, if the IO request is a read request, the first target computing node can first search for the target data from the cache space of the first task, and if the target data is hit in the cache space of the first task, then the target data. If the target data fails to hit in the cache space of the first task, the IO request is sent to other target computing nodes. After receiving the IO request, other target computing nodes search for the target data from the cache space of the task assigned by themselves for the first application, and if the target data is hit, return the target data to the first target computing node, if not If the target data is hit, a notification message is returned to the first target computing node to notify the first target computing node that data acquisition fails. If all other target computing nodes fail to hit the target data, the first target computing node may acquire the target data from the storage system. Optionally, if the IO request is a write request, the first target computing node may write the target data into the cache space of the first task.

It can be known from the above introduction that when the first target computing node fails to hit the target data in the cache space of the first task according to the IO request, the IO request may be sent to other target computing nodes. Similarly, after other target computing nodes generate the IO request, if they fail to hit the data to be accessed in the cache space allocated for the first application, they may also send the IO request to the first target computing node. In this case, the first target computing node may also receive IO requests sent by other target computing nodes, and access the cache space of the first task according to the IO requests. Optionally, computing nodes can send IO requests to access each other's cache space through remote direct memory access (RDMA) technology.

Optionally, if each target computing node prefetches data through the method described in the second possible situation in step 305, the first target computing node may hash the directory path of the target data to be accessed by the IO request operation to obtain the hash value corresponding to the target data, determine the target computing node whose node ID matches the hash value corresponding to the target data, and if the determined target computing node is itself, then the first target computing node accesses itself The buffer space allocated for the first application, so as to realize reading or writing of target data.

Optionally, if the determined target computing node is another target computing node, the first target computing node sends an IO request to the determined target computing node, and the determined target computing node assigns the first application by accessing itself cache space to read or write target data.

Wherein, for the situation that the target data is not hit in the cache space of the first application, the corresponding target computing node can also obtain the target data from the storage system.

Optionally, if the target computing node is the data prefetched by the method described in the third possible situation in step 305, since there is only one target computing node, the target computing node can access itself according to the IO request as In the cache space allocated by the first application. For the access method, refer to the aforementioned implementation method, which will not be repeated in this embodiment of the present application.

Optionally, if the data cache and access policy in the cache policy also includes a data consistency policy, then in this step, when a certain target computing node modifies, deletes, or writes in the cache resource of the first application according to an IO request When receiving target data, the target computing node can also perform a lock operation on the target data, so as to prevent other target computing nodes from accessing the target data and ensure data consistency.

Step 307: The multiple computing nodes send to the management node the bandwidth requirements of the data to be migrated to the storage system in the cache resources of the running applications. The multiple computing nodes include the target computing node, and the multiple applications include the first application.

In the embodiment of the present application, when the computing node detects that the amount of data cached in the cache space allocated by itself for a certain application reaches the second threshold, it may send the bandwidth requirement of the application to the management node. Correspondingly, the management node can receive the bandwidth requirements of each application sent by each computing node in real time. Wherein, the bandwidth requirement is used to indicate the bandwidth required for migrating the data to be migrated in the cache resource of the corresponding application on the corresponding computing node to the storage system. Exemplarily, the bandwidth requirement may include the amount of data to be migrated in the cache resource of the corresponding application on the computing node. Optionally, other information such as an application identifier may also be included, which is not limited in this embodiment of the present application. In addition, the second threshold may be preset according to the size of the cache space allocated for the application on the computing node, for example, the second threshold may be a preset ratio of the total capacity of the cache space allocated for the application, for example, It may be 80% of the cache space allocated for the application, or other values, which are not limited in this embodiment of the present application.

Wherein, the multiple computing nodes include the target computing node, that is, when the target computing node detects that the amount of data cached in the buffer space allocated by itself for the first application reaches the second threshold, it can send the first The bandwidth requirements of the application. At this time, the bandwidth requirement of the first application is used to indicate the bandwidth required to migrate the data to be migrated in the cache space of the first application on the target computing node to the storage system.

Step 308: The management node allocates IO bandwidth for the data to be migrated in the cache resource of the first application according to the bandwidth requirement.

After receiving the bandwidth requirements of each application sent by multiple computing nodes including the target computing node, the management node can use the bandwidth required by the data to be migrated for the corresponding application indicated by the bandwidth requirements of each application and the current To allocate the corresponding IO bandwidth for the data to be migrated of each application.

Exemplarily, the management node may calculate the ratio of the bandwidth required by each application, and then allocate the IO bandwidth for each application according to the ratio and the current remaining bandwidth of the storage system. Among them, if the current remaining bandwidth of the storage system is not greater than the total bandwidth required by each application, the IO bandwidth allocated to each application will be smaller than the required bandwidth; if the current remaining bandwidth of the storage system is greater than the total bandwidth required by each application , the IO bandwidth allocated to each application may be equal to the required bandwidth.

The management node may also adopt other principles to allocate the IO bandwidth to the data to be migrated of each application, which is not limited in this embodiment of the present application.

In addition, the IO bandwidth allocated for each application's data to be migrated can indicate the maximum amount of data that is allowed to be migrated by each application per unit time. For example, when the IO bandwidth allocated for the data to be migrated of the first application is 30MB/s, it means that the target computing node is allowed to migrate at most 30MB of cached data of the first application to the storage system per second.

Since the first application is included in the multiple applications, the management node can allocate the IO bandwidth for the data to be migrated in the cache resource of the first application through the above method.

Step 309: the management node sends the IO bandwidth allocated for the first application to the target computing node.

Among them, after the management node allocates the IO bandwidth for the data to be migrated in the cache resources of each application, it can send the IO bandwidth allocated for the corresponding application to the corresponding computing node.

For example, the management node may send the IO bandwidth allocated for the data to be migrated in the cache resource of the first application to the target computing node.

Step 310: The target computing node stores the data to be migrated in the cache resource of the first application into the storage system according to the IO bandwidth allocated for the first application.

Wherein, the IO bandwidth allocated for the first application is used to indicate the data volume of cached data of the first application that allows the target computing node to migrate this time. Based on this, according to the IO bandwidth allocated by the management node for the first application, the target computing node obtains data whose quantity is not greater than the IO bandwidth from the cache space allocated for the first application as the data to be migrated, and then, according to the mapping specified by the user The directory information migrates the data to be migrated to the storage system for persistent storage. Wherein, the operation of migrating data to the storage system according to the mapping directory information is a reverse operation of prefetching data from the storage system according to the mapping directory information. The specific implementation method can refer to the above introduction, and the embodiment of the present application will not repeat it here.

After the target computing node starts running the first application, whenever it detects that the amount of data in the cache space allocated by itself for the first application reaches the second threshold, it can apply for IO bandwidth from the management node through the above steps 307-310, so that Migrate the data in the cache space of the first application to the storage system according to the IO bandwidth, until the first application finishes running and all the data in the cache space of the first application is migrated to the storage system, the target computing node can release The cache space allocated for the first application.

In this embodiment of the present application, cache resources are scheduled for the first application according to the cache policy for the first application submitted by the user, and data is prefetched into the cache resource of the first application according to the mapping directory information submitted by the user. Subsequently, during the running of the first application, the cache resource of the first application is accessed according to the cache policy. It can be seen that, the embodiment of the present application can sense the user's demand, and then control the resource usage of the application according to the user's demand, thereby improving the application performance.

Secondly, in the embodiment of the present application, the data of the first application stored in the storage system can be prefetched into the cache resource allocated for the first application according to the mapping directory information specified by the user, so that the subsequent running of the first application During the process, if the data to be accessed is the data under the directory indicated by the mapping directory information, the cache resource allocated for the first application can be directly accessed to improve the data access speed. Moreover, the data in the storage system is prefetched by specifying the mapping directory information by the user, which reduces the complexity of the user's use.

Third, in the embodiment of the present application, data can be cached and accessed in the cache resource of the first application according to the data cache and access policy in the cache policy specified by the user, for example, data can be cached according to a hierarchical cache policy, thereby improving Data access performance, save resource consumption of cache space. The data is accessed according to the data consistency policy to ensure the accuracy of the data during the data access process. In addition, users can flexibly customize other policies to achieve flexible settings of data caching and access methods.

Finally, in the embodiment of this application, when each computing node detects that the amount of data in the cache resources allocated by itself for the application reaches the second threshold, it can apply to the management node for allocating the bandwidth requirements of the data to be migrated for the application it is running. . According to the collected bandwidth requirements of each computing node, the management node can allocate the IO bandwidth for migrating the data to be migrated to the application running on each computing node, so as to control the data volume of each computing node migrating data to the storage system, thereby avoiding The application data access volume of different computing nodes exceeds the available bandwidth of the storage system, resulting in I/O bandwidth competition. This enables applications in the global view to access the storage system in an orderly manner, thereby reducing application performance problems caused by IO competition. Moreover, in the embodiment of the present application, according to the mapping directory information specified by the user, the cluster system can automatically complete the data copy without the user's own operation to complete the data copy, which reduces the complexity of the user's operation.

It should be noted that in the above embodiments, the steps related to the management node can be independently implemented as a data access method on the management node side, and the steps related to the computing node side can be independently implemented as a data access method on the computing node side.

The data access method provided according to the embodiment of the present application is described in detail above with reference to FIG. 1 to FIG. 3 , and the data access device provided according to the embodiment of the present application will be described below in conjunction with FIG. 4 to FIG. 6 .

Referring to Fig. 4, the embodiment of the present application provides a data access device 400, which can be applied in a cluster system, and the device 400 includes:

A receiving module 401, configured to execute step 301 in the above embodiment;

A scheduling module 402, configured to execute steps 302-304 in the above embodiments;

A prefetch module 403, configured to execute step 305 in the above embodiment;

The access module 404 is configured to execute step 306 in the above embodiment.

It should be understood that the data access device 400 in this embodiment of the present invention can be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or programmable Logic device (programmable logic device, PLD) realizes, and above-mentioned PLD can be complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic) , GAL) or any combination thereof. When the data access method shown in FIG. 3 can also be realized by software, the data access device 400 and its modules can also be software modules.

Optionally, the scheduling module 402 is mainly used for:

Determine resource requirement information for each of the multiple tasks of the first application according to the cache strategy, where the resource requirement information includes the size of the cache space required by each task and the type of storage medium included;

According to the resource requirement information of each task, cache space is allocated for each task.

Optionally, the mapping directory information includes the directory path of the first directory, and the prefetching module 403 is mainly used for:

Determine the directory identifier of the subdirectory corresponding to each task in the multiple tasks of the first application;

According to the directory path of the first directory and the directory identifier of the subdirectory corresponding to each task, the data under the subdirectory corresponding to each task stored in the first directory is obtained from the storage system;

The data under the subdirectory corresponding to each task is stored in the cache resource of the first application.

Optionally, the access module 404 is mainly used for:

When the caching strategy includes a hierarchical caching strategy, according to the hierarchical caching strategy, different types of task data of each task of the first application are cached in a storage medium of a corresponding type in the caching resource of the first application;

When the cache policy includes a data consistency policy, when accessing task data in any cache space, a lock operation is performed on the accessed task data.

Optionally, the device 400 is also used for:

Obtain input and output IO requests;

If the data accessed by the IO request is the data under the first directory indicated by the mapping directory information, perform the step of accessing the cache resource of the first application according to the cache policy.

Optionally, the device 400 is also used for:

Obtain the bandwidth requirement of the data to be migrated to the storage system in the cache resources of each of the multiple applications, where the multiple applications include the first application;

Allocating IO bandwidth for the data to be migrated in the cache resource of the first application according to the bandwidth requirement;

According to the IO bandwidth, the data to be migrated in the cache resource of the first application is stored in the storage system.

The data access device 400 according to the embodiment of the present application of the present invention can correspond to the implementation of the method described in the embodiment of the present invention, and the above-mentioned and other operations and/or functions of the various units in the data access device 400 are respectively in order to realize Fig. 3 For the sake of brevity, the corresponding processes executed by corresponding nodes in each method in , will not be repeated here.

To sum up, in the embodiment of this application, cache resources are scheduled for the first application according to the cache policy submitted by the user for the first application, and data is prefetched to the cache resources of the first application according to the mapping directory information submitted by the user middle. Subsequently, during the running of the first application, the cache resource of the first application is accessed according to the cache policy. It can be seen that, the embodiment of the present application can sense the user's demand, and then control the resource usage of the application according to the user's demand, thereby improving the application performance.

Referring to Fig. 5, the embodiment of the present application provides a data access device 500, which can be applied to a management node, and the device 500 includes:

A receiving module 501, configured to execute step 301 in the above embodiment;

The scheduling module 502 is configured to execute the operation of sending the caching policy to the target computing node in step 302 and step 303 in the above embodiment, so as to control the target computing node to execute steps 304 to 306 .

It should be understood that the device 500 of the embodiment of the present invention can be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD) implementation, the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL ) or any combination thereof. When the data access method shown in FIG. 3 can also be realized by software, the device 500 and its modules can also be software modules.

Optionally, the scheduling module 502 is mainly used for:

Acquiring resource requirement information of each of the multiple tasks of the first application from the cache policy, where the resource requirement information includes the size of the cache space required by each task and the type of storage medium included;

Allocating target computing nodes for executing each task of the first application according to the resource requirement information;

The cache policy is sent to the target computing node to instruct the target computing node to allocate cache space for the corresponding task from its own cache space according to the resource requirement information in the cache policy.

Optionally, the mapping directory information includes the directory path of the first directory, and the scheduling module 502 is mainly used for:

Send the directory path of the first directory to the target computing node to instruct the target computing node to prefetch the data under the subdirectories of each task stored in the first directory from the storage system according to the directory path of the first directory, and transfer the acquired The data is stored in the cache resource of the first application.

Optionally, the device 500 is also used for:

Receive the bandwidth requirements of the data to be migrated to the storage system in the cache resources of each application sent by multiple computing nodes, the multiple computing nodes including the target computing node;

Sending the IO bandwidth allocated for the first application to the target computing node, so as to instruct the target computing node to store the data to be migrated in the cache resource of the first application in the storage system according to the IO bandwidth allocated for the first application.

The device 500 according to the embodiment of the present application of the present invention may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each unit in the device 500 are for realizing each method in FIG. 3 For the sake of brevity, the corresponding processes executed by the corresponding nodes in , will not be repeated here.

To sum up, in this embodiment of the application, the management node schedules cache resources for the first application according to the cache policy submitted by the user for the first application, and controls the computing node to prefetch data to the first application according to the mapping directory information submitted by the user. In the cache resource of an application. Subsequently, during the running of the first application, the control computing node accesses the cache resource of the first application according to the cache policy. It can be seen that, the embodiment of the present application can sense the user's demand, and then control the resource usage of the application according to the user's demand, thereby improving the application performance.

Referring to FIG. 6, the present application also provides a data access device 600. As shown in FIG. 6, the data access device 600 can be applied to computing nodes, and the data access device 600 includes:

The receiving module 601 is configured to receive the caching strategy and mapping directory information of the first application specified by the user, the caching strategy is used to indicate the caching requirements of the first application, and the mapping directory information is where the application data of the first application stored in the storage system is located. the information of the first directory;

An allocation module 602, configured to execute step 304 in the foregoing embodiment;

A prefetch module 603, configured to execute step 305 in the foregoing embodiment;

The access module 604 is configured to execute step 306 in the foregoing embodiment.

It should be understood that the data access device 600 in this embodiment of the present invention can be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or programmable Logic device (programmable logic device, PLD) realizes, and above-mentioned PLD can be complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic) , GAL) or any combination thereof. When the data access method shown in FIG. 3 can also be realized by software, the data access device 600 and its modules can also be software modules.

Optionally, the allocation module 602 is mainly used for:

According to the resource requirement information of each task, a cache space is allocated in its own cache resource for the first task run by itself, where the first task is any one of multiple tasks running on the computing node.

Optionally, the prefetching module 603 is mainly used for:

Determine the directory identifier of the subdirectory corresponding to the first task;

According to the directory path of the first directory and the directory identifier of the subdirectory corresponding to the first task, the data under the subdirectory corresponding to the first task stored in the first directory is obtained from the storage system;

The data under the subdirectory corresponding to the first task is stored in the cache space of the first task.

Optionally, if the caching strategy includes a hierarchical caching strategy, the prefetch module is further configured to: store different data in different types of storage media according to the data type of the data in the subdirectory corresponding to the first task.

Optionally, the access module 604 is mainly used for:

Get IO request;

If the data accessed by the IO request is the data under the first directory indicated by the mapping directory information, the cache resource of the first application is accessed according to the IO request and the cache policy.

Optionally, if the cache policy includes a data consistency policy, the access module is mainly used for:

When accessing data in the cache resource of the first application, a locking operation is performed on the accessed data.

Optionally, the apparatus 600 is further configured to: when detecting that the amount of data in the cache resource allocated by itself for the first application reaches a reference threshold, send the bandwidth requirement of the first application to the management node, and the bandwidth requirement is used to indicate that the computing The bandwidth required for migrating the data to be migrated in the cache resource of the first application on the node to the storage system; receiving the IO bandwidth allocated by the management node for the data to be migrated in the cache resource of the first application, according to the allocated IO bandwidth, the Data to be migrated in cache resources of an application is migrated to a storage system.

The device 600 according to the embodiment of the present application of the present invention may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each unit in the device 600 are for realizing each method in FIG. 3 For the sake of brevity, the corresponding processes executed by the corresponding nodes in , will not be repeated here.

In this embodiment of the application, the computing node can allocate corresponding cache resources for the first application according to the caching policy of the first application specified by the user, so that the resource usage of the computing node by the first application meets the user's requirements, and then the first application The application performance of the application can meet user needs. On this basis, in the process of running the first application, data access can be directly performed in the cache resources allocated by the computing node for the first application, reducing the access to the storage system, thereby reducing the communication between computing nodes. compete. Moreover, after allocating cache resources for the first application, the computing node can prefetch the application data of the first application under the first directory in the storage system into the first cache space according to the mapping directory information, without manual data copying by the user. Reduced operational complexity.

The present application also provides a data access system, the system includes a management node and a computing node, wherein the connection mode between the management node and the computing node can refer to the connection mode between the management node and the computing node in the system shown in Figure 1 , the structure of the management node and the computing node may refer to the structure of the computer device shown in FIG. 2 . In this data access system, the management node is used to realize the function of the management node in the data access method shown in Figure 3, and the computing node is used to realize the function of the computing node in the data access method shown in Figure 3. Examples will not be repeated here.

It should be noted that: when the data access device provided in the above-mentioned embodiment reads and writes data, it only uses the division of the above-mentioned functional modules as an example. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , which divides the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the data access device and the data access method embodiments provided by the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.

In the above embodiments, all or part may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: Digital Versatile Disc (Digital Versatile Disc, DVD)), or a semiconductor medium (for example: Solid State Disk (Solid State Disk, SSD) )wait.

Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

The above description is not intended to limit the embodiments of the present application, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the embodiments of the present application shall be included within the scope of protection of the embodiments of the present application.

Claims

A data access method, characterized in that the method comprises:

Receive a cache configuration request for the first application submitted by the user, the cache configuration request includes a cache policy and mapping directory information, the cache policy is used to indicate the cache requirements of the first application, and the mapping directory information is a storage system information about the first directory where the application data of the first application is stored;

Scheduling cache resources for the first application according to the cache policy;

Prefetching application data of the first application into cache resources of the first application according to the mapping directory information;

During the running of the first application, according to the cache policy, the cache resources of the first application are accessed.
The method according to claim 1, wherein the scheduling cache resources for the first application according to the cache policy comprises:

determining resource requirement information of each of the multiple tasks of the first application according to the cache policy, where the resource requirement information includes the size of the cache space required by each task and the type of storage medium included;

According to the resource requirement information of each task, cache space is allocated for each task.
The method according to claim 1, wherein the mapping directory information includes a directory path of the first directory, and according to the mapping directory information, prefetching the application data of the first application to the The cache resources of the first application include:

determining a directory identifier of a subdirectory corresponding to each of the multiple tasks of the first application;

According to the directory path of the first directory and the directory identifier of the subdirectory corresponding to each task, obtain the data under the subdirectory corresponding to each task stored in the first directory from the storage system;

The data under the subdirectory corresponding to each task is stored in the cache resource of the first application.
The method according to claim 2 or 3, wherein the accessing the cache resource of the first application according to the cache policy comprises:

When the caching policy includes a hierarchical caching policy, according to the hierarchical caching policy, cache different types of task data of each task of the first application to a storage medium of a corresponding type in the caching resource of the first application middle;

When the cache policy includes a data consistency policy, when data in any cache space is accessed, a lock operation is performed on the accessed task data.
The method according to any one of claims 1-4, wherein before accessing the cache resources of the first application according to the cache policy, further comprising:

Obtain input and output IO requests;

If the data accessed by the IO request is the data under the first directory indicated by the mapping directory information, the step of accessing the cache resource of the first application according to the cache policy is performed.
The method according to any one of claims 1-5, further comprising:

Acquiring bandwidth requirements of data to be migrated to the storage system in cache resources of each of multiple applications, where the multiple applications include the first application;

Allocating IO bandwidth for the data to be migrated in the cache resource of the first application according to the bandwidth requirement;

According to the IO bandwidth, the data to be migrated in the cache resource of the first application is stored in the storage system.
A data access device, characterized in that the device comprises:

The receiving module is configured to receive a cache configuration request for the first application submitted by the user, the cache configuration request includes cache policy and mapping directory information, the cache policy is used to indicate the cache requirements of the first application, and the mapping The directory information is the information of the first directory where the application data of the first application stored in the storage system is located;

A scheduling module, configured to schedule cache resources for the first application according to the cache policy;

A prefetching module, configured to prefetch the application data of the first application into the cache resource of the first application according to the mapping directory information;

An access module, configured to access cache resources of the first application according to the cache policy during running of the first application.
The device according to claim 7, wherein the scheduling module is mainly used for:

determining resource requirement information of each of the multiple tasks of the first application according to the cache policy, where the resource requirement information includes the size of the cache space required by each task and the type of storage medium included;

According to the resource requirement information of each task, cache space is allocated for each task.
The device according to claim 7, wherein the mapping directory information includes a directory path of the first directory, and the prefetching module is mainly used for:

determining a directory identifier of a subdirectory corresponding to each of the multiple tasks of the first application;

According to the directory path of the first directory and the directory identifier of the subdirectory corresponding to each task, obtain the data under the subdirectory corresponding to each task stored in the first directory from the storage system;

The data under the subdirectory corresponding to each task is stored in the cache resource of the first application.
The device according to claim 8 or 9, wherein the access module is mainly used for:

When the caching policy includes a hierarchical caching policy, according to the hierarchical caching policy, cache different types of task data of each task of the first application to a storage medium of a corresponding type in the caching resource of the first application middle;

When the cache policy includes a data consistency policy, when accessing task data in any cache space, a locking operation is performed on the accessed task data.
The device according to any one of claims 7-10, wherein the device is also used for:

Obtain input and output IO requests;

If the data accessed by the IO request is the data under the first directory indicated by the mapping directory information, the step of accessing the cache resource of the first application according to the cache policy is performed.
The device according to any one of claims 7-11, said device is also used for:

Acquiring bandwidth requirements of data to be migrated to the storage system in cache resources of each of multiple applications, where the multiple applications include the first application;

Allocating IO bandwidth for the data to be migrated in the cache resource of the first application according to the bandwidth requirement;

According to the IO bandwidth, the data to be migrated in the cache resource of the first application is stored in the storage system.
A computer-readable storage medium, characterized in that the storage medium stores instructions, and when the instructions are executed on a computer, the computer is made to execute the data access method described in any one of claims 1-6 .