CN117435353A

CN117435353A - Comprehensive optimization method for high-frequency checkpoint operation

Info

Publication number: CN117435353A
Application number: CN202311757384.7A
Authority: CN
Inventors: 苏毅; 陈洁; 张博平; 刘雨蒙; 赵怡婧
Original assignee: Beijing Institute of Remote Sensing Equipment
Current assignee: Beijing Institute of Remote Sensing Equipment
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-01-23
Anticipated expiration: 2043-12-20
Also published as: CN117435353B

Abstract

The specification provides a comprehensive optimization method for high-frequency checkpoint operation, and relates to the technical field of data recovery. According to the initialized software and hardware environment, GPU cache increment allocation is carried out, and management and access of equipment caches are gradually optimized; mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed; concurrency control and registration contention are optimized by separate producer-consumer policies, host buffer serialization registration policies. The method solves the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment. The method provides a unique and effective solution to the high-frequency check point operation problem in the high-performance computing environment through reasonable cache optimization and concurrency control optimization means, and remarkably improves the performance and efficiency of the computing task.

Description

Comprehensive optimization method for high-frequency checkpoint operation

Technical Field

The document relates to the technical field of data recovery, in particular to a comprehensive optimization method for high-frequency checkpoint operation.

Background

In a high performance computing environment, short-term computing tasks require frequent checkpointing to ensure data reliability. Although the running time of these tasks is very short, typically only a few seconds or minutes, the results produced are very important and cannot be replaced. Such tasks are widely covered in the fields of scientific simulation, data analysis, machine learning training and the like. Due to the uncertainty of the computing process and the high degree of parallelism of the computing nodes, these tasks may be subject to interruption risks, such as hardware failures, resource contention, network interruptions, and the like. To ensure that completed computing states and intermediate results are restored even if an interrupt occurs, these short-term companion computing tasks require frequent checkpointing, saving the current state and data in persistent storage to resume computation from the most recent checkpoint after the interrupt.

However, high frequency checkpointing also presents new challenges, including computing and storage overhead, and the impact of checkpointing on computing performance. Current solutions include:

(1) One-time allocation and fixing of host memory pages:

this approach may result in expensive initialization costs, especially in the concurrent case. In addition, fixedhost memory pages may reduce concurrency performance because resources of the Fixedmemory pages may be contended between tasks.

(2) Direct access to host memory:

although initialization overhead may be reduced, performance may be limited due to underutilization of the advantages of GPU cache. Furthermore, concurrency performance between tasks may also be affected.

(3) Virtual memory management:

in a high performance computing environment, conventional virtual memory management methods may not fully exploit the capabilities of hardware devices, particularly when large-scale data transfer and high concurrent access are involved.

Therefore, an optimization scheme for high-frequency checkpointing is needed to solve the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment.

Disclosure of Invention

The specification provides a comprehensive optimization method oriented to high-frequency check point operation, which solves the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment.

In a first aspect, the present disclosure provides a comprehensive optimization method for high-frequency checkpointing, including three parts, including GPU cache increment allocation, opportunistic access and delay registration of host memory, and optimization concurrency control, specifically including the steps of:

according to the initialized software and hardware environment, GPU cache increment allocation is carried out, and management and access of equipment caches are gradually optimized;

mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed;

concurrency control and registration contention are optimized by separate producer-consumer policies, host buffer serialization registration policies.

The invention has the beneficial effects that

The method integrates GPU cache optimization and host memory optimization strategies, and the method for gradually and incrementally distributing the GPU cache and opportunistically accessing the host memory is adopted, so that the problems of high initialization cost, limited concurrency performance and the like in the prior art are effectively solved. This comprehensive cache optimization strategy brings a completely new solution for high frequency checkpointing. Secondly, the device cache initialization and management strategy of the method is innovative. By initializing the equipment cache in a reserved and mapped mode, the performance overhead and the competitive influence caused by the traditional one-time allocation and the fixed host memory page are avoided. The strategy fully utilizes GPU hardware resources while ensuring performance, and realizes efficient calculation and storage performance. In addition, the method introduces opportunistic access and delay registration strategies of the host memory, and the host memory is mapped in advance through the operation of writing one byte, so that the delay after the system is started is reduced. Meanwhile, by delaying registration, the competition of registration of a host buffer area is effectively reduced, and the performance is further optimized. Meanwhile, the method avoids competition between opportunistic access and check point refreshing by introducing a separated producer-consumer strategy, thereby improving the stability and performance of the system. In addition, the serialized host buffer registration policy further reduces registration contention to better leverage hardware resources. In summary, the method provides a unique and effective solution to the high-frequency checkpointing problem in a high-performance computing environment through reasonable cache optimization, concurrency control optimization and other means, and remarkably improves the performance and efficiency of computing tasks.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic diagram of a comprehensive optimization method for high frequency checkpointing according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a comprehensive optimization method flow for high-frequency checkpointing according to an embodiment of the present disclosure.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Detailed description of the preferred embodiments

The embodiment provides a comprehensive optimization method for high-frequency checkpointing operation, which is shown in fig. 1, and the workflow of the method is shown in fig. 2;

it should be noted that, the high-frequency checkpointing mainly occurs in the computing process requiring frequent iteration, and because the task runs briefly, the interrupt factor may cause the completed computing to be partially lost, so that the checkpointing becomes a key step for ensuring computing restorability.

Specifically, the method comprises GPU cache increment allocation, host memory opportunistic access and delay registration, optimization and

three main parts of hair control;

first, before performing step S1, the method further comprises: the software and hardware initialization step may be implemented as follows:

s01, software and hardware environment preparation

Before starting the initialization, it is necessary to prepare an appropriate software and hardware environment to ensure efficient implementation of the scheme. Specifically, the following requirements need to be met:

CUDA environment configuration: ensuring that the CUDA Toolkit is installed in the system so that the CUDA API function can be invoked. In addition, GPU hardware that supports CUDA is also required.

VELOC runtime Environment: ensuring that the VELOC runtime environment has been deployed and properly configured. This includes the VELOC library and related dependent items.

The multi-core processor: the system is ensured to have multiple cores available to generate sub-threads and to implement concurrent operations.

S02, generating a child thread

During initialization, two sub-threads are introduced for initialization of the device buffer and the host buffer, respectively, and subsequent asynchronous transfer operations. Therefore, the multi-core processor can be effectively utilized, concurrency is improved, and the initialization process is accelerated.

S03, initializing equipment cache

Initialization of the device cache involves two main functions: cumemaddresReserve () and cuMemMap ().

This CUDA API function is used to reserve a piece of memory address space on the device to ensure that subsequent allocations do not conflict with other memory regions.

The cuMemMap ()'s function is used to map reserved virtual address space stepwise onto physical HBM (High Bandwidth Memory) memory. The cost of mapping operation can be effectively managed through gradual mapping, and the mapping of pages is carried out when needed, so that the performance problem caused by mapping a large number of pages at one time is avoided.

S04, host buffer initialization

The initialization of the host buffer includes allocation of virtual memory and access of pages.

malloc (): this is a standard C library function that allocates a piece of virtual memory on the host. In this scheme, this function is used to allocate enough virtual memory for the host buffer for subsequent use.

Access to host memory: by accessing each page of the host buffer during the initialization phase, the operating system maps the physical page to a virtual address space for subsequent use.

Based on this, a basis is provided for subsequent operations by reasonable device cache and host memory initialization during the initialization phase.

S1, performing GPU cache increment allocation according to initialized software and hardware environments, and gradually optimizing management and access of equipment caches;

specifically, one specific implementation manner of step S1 may be:

s11, distributing equipment cache virtual memory: each sub-thread allocates virtual memory space for the GPU to which it belongs using a cumemadessreserve () function;

it should be noted that the cumemadd reserve () function plays an important role in the stage of device cache allocation, and by calling this function, each GPU allocates a virtual memory space for its cache, providing the necessary address space for the mapping operation. The allocated virtual address space will be used for subsequent device cache mapping, providing a basis for high-speed access of data.

S12, gradually mapping virtual cache: device cache virtual memory based on pre-allocation, through uDeviceGetAttribute ()

A function for determining an optimal block size for memory mapping; for each mapping block, allocating a physical memory page on the GPU using a cumemolloc () function; mapping the allocated physical memory pages into a virtual memory space by means of a cuMemMap () function, and establishing a mapping relation between virtual addresses and physical addresses; the size of the mapping area is gradually expanded in a sampling and repeated iteration mode, and the memory range of mapping is dynamically increased;

specific examples:

the device cache mapping stage needs to be carefully optimized to ensure that data is efficiently transferred between the device and the physical memory. The specific implementation steps are as follows:

(1) Selecting a mapping block size

By calling the cuDeviceGetAttribute () function, attribute information of the current GPU, in particular parameters related to the memory mapped optimal block size, is obtained. This step helps to select the appropriate mapping size to optimize data transmission efficiency.

It should be noted that, through the cuDeviceGetAttribute () function, attribute information of the GPU may be obtained, including an optimal block size of the memory map. Selecting an appropriate mapping block size according to hardware attributes helps to improve data transmission efficiency.

(2) Allocating physical memory pages

For each mapping block, a physical memory page is allocated on the GPU using a cumemolloc () function. These memory pages will be the basis for actual data storage, providing support for data transfer.

The cumemolloc () function allocates physical memory pages on the device to provide support for subsequent data transfer and storage operations.

(3) Mapping virtual memory

The previously allocated physical memory pages are mapped into virtual memory space by means of a cuMemMap () function. By establishing the mapping relation between the virtual address and the physical address, the GPU can directly access the memory pages, so that the access speed of data is greatly improved.

The cuMemMap () function is used to map the physical memory on the GPU into a virtual address space. Through this mapping, the GPU can directly access the physical memory, thereby achieving high-speed data access.

(4) Cyclic mapping

In order to adapt to the requirements of different calculation stages, the size of the mapping area is gradually expanded in a multi-iteration mode. This cyclic mapping strategy allows the memory range of the map to be dynamically increased in response to changes in the amount of data. By adjusting the size of the mapping at different stages, the data access requirements can be better matched, and the performance is improved.

S13, enabling access rights: and setting an access right mark according to the ID of the GPU and the starting address of the mapping region, and utilizing a cuMemSetAccess () function to ensure that data access operations among different GPUs can be performed simultaneously without conflict.

Specific examples:

to support concurrent access of data between multiple GPUs, access permissions are enabled for the mapped region of each GPU using a cuMemSetAccess () function. By setting the access right mark according to the ID of the GPU and the starting address of the mapping area, the data access operation between different GPUs can be ensured to be carried out simultaneously without conflict, thereby improving the overall calculation performance.

The function plays a key role in the concurrent access of data in a multi-GPU environment. By calling this function, access rights can be set for the mapped region of each GPU, ensuring that data access operations can be performed in a concurrent manner.

Based on the method, in the GPU cache increment allocation stage, the management and access of the equipment cache are optimized through the steps of selecting the size of a mapping block, allocating physical memory pages, mapping virtual memory and the like. Meanwhile, by enabling the access rights, data concurrent access among multiple GPUs is realized.

S2, mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed;

in order to allocate physical memory in advance and reduce registration overhead in the system initialization stage, the embodiment introduces an opportunistic access policy of the host memory. Through the strategy, the memory of the host can be effectively managed, and the system performance and the resource utilization rate are improved.

It should be noted that, the access of the host memory page may cause the reservation and mapping operations of the corresponding physical memory page. However, the registration overhead of Fixedmemory pages is relatively small. Based on this principle, a policy of delaying registration is adopted. Specifically, registration of pages of host memory is deferred until all pages have been accessed. The method has the advantage that the cost brought by registration operation is reduced on the premise of ensuring that the memory page is mapped.

Specifically, a specific implementation manner of step S2 may be:

s21, opportunistic access of pages: when the system is started, writing a byte into a host memory page which is not registered, and triggering an operating system to reserve and map a corresponding physical memory page into a virtual memory page in advance;

specific examples:

(1) Page opportunistic access

In the initialization stage, an opportunistic access policy of the host memory page is adopted. Specifically, a byte is written into a host memory page that has not been registered, and the operating system is triggered to reserve and map a corresponding physical memory page into a virtual memory page in advance. The main purpose of this step is to map the host memory to the virtual address space in advance when the system is started, avoiding the occurrence of page exchange during the system operation, and thus reducing the delay.

Opportunistic access does not involve the reading and writing of large-scale data, but rather merely the operation of writing one byte. This is because the goal is to trigger page mapping, not the processing of the actual data. Thus, by writing one byte, the operating system can be made aware that the pages need to be reserved and mapped in preparation for subsequent accesses.

S22, delay registration: when the memory is accessed and registered, the registration operation of the host memory page is performed after all the pages are accessed.

It should be noted that, the key to implementing the delayed registration is the tracking and recording of the page access. During the memory access process, we need to record the access condition of each host memory page, so as to register after all pages are accessed.

Further, one specific implementation manner of step S22 may be:

s221, access tracking: monitoring access of memory pages of a host computer each time in the memory access process;

it should be noted that this may be achieved by a hardware performance counter or inserting code when accessing a page.

S222, page record: marking each accessed page and recording in a list;

specifically, for each page accessed, it is marked as "accessed" and recorded in a list. This list will contain all host memory pages that have been accessed.

S223, delaying registration triggering: when all pages are accessed, the system triggers a delay registration operation; traversing the list of accessed pages, registering each page in the list, and fixing the physical memory mapping of the page.

Based on this, by accessing and delaying the registration phase in the host memory opportunistic access, the host memory is allocated in advance and registration overhead is reduced by page opportunistic access and delaying the registration policy.

S3, optimizing concurrency control and registration competition through separated producer-consumer strategies and host buffer zone serialization registration strategies.

It should be noted that, in order to optimize the competition between opportunistic access and checkpointing refresh, the present embodiment proposes a separate producer-consumer strategy. By efficiently allocating resources and managing memory accesses, concurrency contention can be minimized, thereby optimizing the alternation between refresh and opportunistic accesses.

In order to effectively solve the registration competition problem of the host buffer, the embodiment introduces a serialized host buffer registration policy. The strategy effectively reduces performance degradation caused by competition and reduces overall overhead.

Specifically, a specific implementation manner of step S3 may be:

s31, alternately executing a check point refreshing task and executing opportunistic access according to the separated producer-consumer policy;

wherein the producer-consumer policy comprises: a producer-consumer model; wherein,

the producer-consumer model refers to a concurrent programming model in which tasks are divided into two roles: a producer and a consumer, wherein the producer refers to a task that performs opportunistic access and the consumer refers to a task that performs checkpoint refresh.

By definitely dividing the two roles of producer and consumer, the competitive condition can be avoided, thereby improving the system efficiency.

Specifically, one specific implementation manner of step S31 may be:

s311, producer optimization: the producer accesses the memory pages of the host according to ascending order;

such sequential access may reduce contention between memory page accesses, thereby reducing latency. In a specific implementation, the sequence of page access can be adjusted, so that access of adjacent pages cannot jump greatly, and the locality of memory access is improved.

S312, consumer optimization: based on a batch refreshing strategy and an asynchronous refreshing strategy, a consumer refreshes the accessed page to a storage layer so as to keep the consistency of data;

further, the specific implementation manner of step S312 may be:

(1) Batch refreshing: combining the refresh operations of the plurality of pages into a batch refresh operation;

this can reduce the number of refresh operations, thereby reducing the overhead incurred by refresh.

(2) Asynchronous refresh: the refresh operation is asynchronously performed in parallel with other tasks.

By asynchronous refreshing, the bandwidth of the storage layer can be fully utilized, and the concurrency of the system is improved.

Based on this, by employing separate producer-consumer policies, separation of opportunistic access and checkpoint refresh is achieved, thereby reducing concurrency competition. The isolation between producer and consumer can effectively avoid competitive conditions and improve the stability and performance of the system.

S32, according to the host buffer serialization registration strategy, each GPU is allowed to register the host buffer one by one in a polling registration mode, and other GPUs continue to refresh to unregistered memory pages.

Specifically, one specific implementation of step S32 may be:

s321, registration sequence control: in the system initialization stage, a registration sequence number is allocated to each GPU, and the registration sequence of each GPU is determined;

these numbers may be assigned based on information of the GPU's physical location, device ID, etc., thereby achieving an orderly registration order.

S322, polling and registering process: polling each GPU one by one according to the registration sequence number; when a certain GPU registers a host buffer, other GPUs cannot perform registration operation and need to continuously refresh data to an unregistered host memory page;

s323, refreshing unregistered memory: after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.

By polling the registration strategy, a simple and efficient contention optimization scheme is achieved. Under this strategy, each GPU has the opportunity to complete registration of the host buffer in sequence, while other GPUs maintain synchronization of data by continuous refresh operations, thereby avoiding excessive registration contention, reducing performance degradation and overhead increase.

In summary, the embodiment combines the GPU cache optimization and the host memory optimization strategy, and adopts the method of gradually and incrementally allocating the GPU cache and opportunistically accessing the host memory, thereby effectively overcoming the problems of high initialization cost, limited concurrency performance and the like in the prior art. This comprehensive cache optimization strategy brings a completely new solution for high frequency checkpointing. Secondly, the device cache initialization and management strategy of the method is innovative. By initializing the equipment cache in a reserved and mapped mode, the performance overhead and the competitive influence caused by the traditional one-time allocation and the fixed host memory page are avoided. The strategy fully utilizes GPU hardware resources while ensuring performance, and realizes efficient calculation and storage performance. In addition, the method introduces opportunistic access and delay registration strategies of the host memory, and the host memory is mapped in advance through the operation of writing one byte, so that the delay after the system is started is reduced. Meanwhile, by delaying registration, the competition of registration of a host buffer area is effectively reduced, and the performance is further optimized. Meanwhile, the method avoids competition between opportunistic access and check point refreshing by introducing a separated producer-consumer strategy, thereby improving the stability and performance of the system. In addition, the serialized host buffer registration policy further reduces registration contention to better leverage hardware resources. In summary, the method provides a unique and effective solution to the high-frequency checkpointing problem in a high-performance computing environment through reasonable cache optimization, concurrency control optimization and other means, and remarkably improves the performance and efficiency of computing tasks.

Second embodiment

The embodiment provides a comprehensive optimization method for high-frequency checkpointing operation, so as to optimize checkpointing operation in a multi-GPU environment, and the flow of the method is shown in FIG. 2; the method specifically comprises the following steps:

step 1, initializing preparation

1.1 Software and hardware environment preparation

1.2 Generating sub-threads

1.3 Device cache initialization

1.4 Host buffer initialization

malloc (): this is a standard C library function that allocates a piece of virtual memory on the host. In this scheme, it is used to allocate enough virtual memory for host buffers for later use.

Access to host memory: by accessing each page of the host buffer during the initialization phase, the operating system will map the physical page to the virtual address space for subsequent use.

Step 2 GPU cache increment Allocation

2.1 Distributing equipment buffer virtual memory

In the initialization phase, each child thread allocates virtual memory space for the GPU to which it belongs using a cumemaddress reserve () function. This virtual address space will be used for subsequent device cache mapping, providing the basis for high-speed access of data.

This critical function plays an important role in the phase of device cache allocation. By invoking this function, each GPU allocates virtual memory space for its cache, providing the necessary address space for the mapping operation.

2.2 Gradual mapping virtual cache

selecting a mapping block size:

Through this function, the attribute information of the GPU, including the optimal block size of the memory map, can be obtained. Selecting an appropriate mapping block size according to hardware attributes helps to improve data transmission efficiency.

Physical memory pages are allocated: for each mapping block, a physical memory page is allocated on the GPU using a cumemolloc () function. These memory pages will be the basis for actual data storage, providing support for data transfer.

cuMemAlloc () this function allocates physical memory pages on the device, providing support for subsequent data transfer and storage operations.

Mapping virtual memory: the previously allocated physical memory pages are mapped into virtual memory space by means of a cuMemMap () function. By establishing the mapping relation between the virtual address and the physical address, the GPU can directly access the memory pages, so that the access speed of data is greatly improved.

This function is critical for mapping physical memory on the GPU into virtual address space. Through this mapping, the GPU can directly access the physical memory, thereby achieving high-speed data access.

Cyclic mapping: in order to adapt to the requirements of different calculation stages, the size of the mapping area is gradually expanded in a multi-iteration mode. This cyclic mapping strategy allows the memory range of the map to be dynamically increased in response to changes in the amount of data. By adjusting the size of the mapping at different stages, the data access requirements can be better matched, and the performance is improved.

2.3 Enabling access rights

Step 3, opportunistic access and delayed registration of host memory

In order to allocate physical memory in advance and reduce registration overhead in the system initialization stage, the scheme introduces an opportunistic access policy of the host memory. Through the strategy, the memory of the host can be effectively managed, and the system performance and the resource utilization rate are improved. The following are the detailed implementation steps of the strategy:

3.1 Page opportunistic access

3.2 Delay registration

The access of the host memory page may cause reservation and mapping operations of the corresponding physical memory page. However, the registration overhead of Fixedmemory pages is relatively small. Based on this principle, a policy of delaying registration is adopted. Specifically, registration of pages of host memory is deferred until all pages have been accessed. The method has the advantage that the cost brought by registration operation is reduced on the premise of ensuring that the memory page is mapped.

The key to achieving deferred registration is the tracking and recording of page accesses. In the memory access process, the access condition of each host memory page needs to be recorded so as to register after all pages are accessed. The following steps may be taken:

and (3) access tracking, namely monitoring access of the memory pages of the host computer each time in the memory access process. This may be achieved by a hardware performance counter or inserting code when accessing the page.

Page record-for each page accessed, it is marked as "accessed" and recorded in a list. This list will contain all host memory pages that have been accessed.

Delay registration trigger:

when all pages have been accessed, the system triggers a deferred registration operation. At this stage, the list of accessed pages is traversed, each page therein is registered, and its physical memory map is fixed.

Step 4, optimizing concurrency control

To optimize the competition between opportunistic access and checkpoint refresh, the present scheme proposes a separate producer-consumer strategy. By efficiently allocating resources and managing memory accesses, concurrency contention can be minimized, thereby optimizing the alternation between refresh and opportunistic accesses. The following are the detailed implementation steps of the isolated producer-consumer strategy, and further optimization details:

4.1 Producer-consumer model

The producer-consumer model is a concurrent programming model in which tasks are divided into two roles: producer and consumer. In the context, producer refers to a task that performs opportunistic access, while consumer refers to a task that performs checkpoint refresh. By explicitly dividing these two roles, race conditions can be avoided, thereby improving system efficiency.

4.2 Isolated producer-consumer strategy

In order to achieve a separate producer-consumer strategy, the following optimization measures are employed:

optimizing the producer: the producer accesses the host memory pages in ascending order. Such sequential access may reduce contention between memory page accesses, thereby reducing latency. In a specific implementation, the sequence of page access can be adjusted, so that access of adjacent pages cannot jump greatly, and the locality of memory access is improved.

Consumer optimization: the consumer refreshes the accessed page to the storage layer to maintain data consistency. To optimize consumer performance, the following strategies may be employed:

batch refreshing: the refresh operations for multiple pages are combined into one bulk refresh operation. This may reduce the number of refresh operations, thereby reducing the overhead incurred by refresh.

Asynchronous refresh: the refresh operation is asynchronously performed in parallel with other tasks. By asynchronous refreshing, the bandwidth of the storage layer can be fully utilized, and the concurrency of the system is improved.

By employing separate producer-consumer policies, separation of opportunistic access and checkpoint refresh is achieved, thereby reducing concurrency competition. The isolation between producer and consumer can effectively avoid competitive conditions and improve the stability and performance of the system.

Step 5 host buffer serialization

In the scheme, in order to effectively solve the registration competition problem of the host buffer, a serialized host buffer registration strategy is introduced. The core idea of the strategy is to allow each GPU to register the host buffer area one by one in a polling registration mode, and other GPUs continue to refresh the unregistered memory pages, so that performance degradation caused by competition is effectively reduced, and overall cost is reduced.

5.1 Detailed implementation of a poll registration policy

The implementation steps of the strategy are as follows:

registration order control: in the system initialization phase, each GPU is assigned a registration order number to determine the order in which they are registered. These numbers may be assigned based on information of the GPU's physical location, device ID, etc., thereby achieving an orderly registration order.

Polling registration process: each GPU is polled one by one according to the registration order number. When a GPU registers with a host buffer, other GPUs are blocked and cannot register, but continue to refresh data to unregistered host memory pages.

Refreshing unregistered memory: after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.

In summary, the present embodiment comprehensively uses methods such as GPU cache optimization, host memory optimization, concurrency control optimization, and the like, and solves the challenges faced by high-frequency checkpointing in the prior art by gradually and incrementally allocating GPU caches and opportunistic access policies to host memory. In the specific implementation, the scheme is initialized through reasonable equipment cache and host memory in the initialization stage, and a basis is provided for subsequent operation. In the GPU cache increment allocation stage, the management and access of equipment caches are optimized through the steps of selecting the size of a mapping block, allocating physical memory pages, mapping virtual memory and the like. Meanwhile, by enabling the access rights, data concurrent access among multiple GPUs is realized. In the stage of opportunistic access and delayed registration of the host memory, the host memory is allocated in advance and registration overhead is reduced through page opportunistic access and delayed registration strategies. Finally, concurrent control and registration contention is optimized through separate producer-consumer policy and host buffer serialized registration, enabling high performance checkpointing. The technical scheme has remarkable application value in the field of high-performance calculation, and can improve the calculation performance and efficiency.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A comprehensive optimization method for high-frequency checkpointing operation, comprising:

2. The method according to claim 1, wherein the performing GPU cache increment allocation according to the initialized hardware and software environment gradually optimizes management and access of the device cache, includes:

each sub-thread allocates virtual memory space for the GPU to which it belongs using a cumemadessreserve () function;

determining the optimal block size of the memory mapping through a uDeviceGetAttribute () function based on the pre-allocated device cache virtual memory; for each mapping block, allocating a physical memory page on the GPU using a cumemolloc () function; mapping the allocated physical memory pages into a virtual memory space by means of a cuMemMap () function, and establishing a mapping relation between virtual addresses and physical addresses; the size of the mapping area is gradually expanded in a sampling and repeated iteration mode, and the memory range of mapping is dynamically increased;

and setting an access right mark according to the ID of the GPU and the starting address of the mapping region, and utilizing a cuMemSetAccess () function to ensure that data access operations among different GPUs can be performed simultaneously without conflict.

3. The method of claim 1, wherein the mapping the host memory page to the virtual address space in advance using the opportunistic access policy and the deferred registration policy, and performing the registration of the host memory page after all the pages have been accessed, comprises:

when the system is started, writing a byte into a host memory page which is not registered, and triggering an operating system to reserve and map a corresponding physical memory page into a virtual memory page in advance;

when the memory is accessed and registered, the registration operation of the host memory page is performed after all the pages are accessed.

4. The method of claim 3, wherein the accessing and registering the memory after all the pages have been accessed comprises:

monitoring access of memory pages of a host computer each time in the memory access process;

marking each accessed page and recording in a list;

when all pages are accessed, the system triggers a delay registration operation; traversing the list of accessed pages, registering each page in the list, and fixing the physical memory mapping of the page.

5. The method of claim 1, wherein optimizing concurrency control and registration contention by separate producer-consumer policies, host buffer serialization registration policies, comprises:

alternately executing the checkpoint refresh task and executing the opportunistic access according to the separated producer-consumer policies;

according to the host buffer serialization registration strategy, each GPU is allowed to register the host buffer one by one in a polling registration mode, and other GPUs continue to refresh to unregistered memory pages.

6. The method of claim 5, wherein the tasks in the producer-consumer policy are divided into: a producer and a consumer, wherein the producer refers to a task that performs opportunistic access and the consumer refers to a task that performs checkpoint refresh.

7. The method of claim 6, wherein the alternating execution of checkpoint refresh tasks with execution of opportunistic accesses according to separate producer-consumer policies comprises:

the producer accesses the memory pages of the host according to ascending order;

based on the batch refreshing strategy and the asynchronous refreshing strategy, the consumer refreshes the accessed pages to the storage layer so as to maintain the consistency of the data.

8. The method of claim 7, wherein the bulk refresh policy refers to merging refresh operations of multiple pages into one bulk refresh operation;

the asynchronous refresh policy refers to asynchronously executing refresh operations in parallel with other tasks.

9. The method according to claim 5, wherein the allowing each GPU to register its host buffer one by one while the other GPUs continue to refresh to unregistered memory pages by polling registration according to a host buffer serialization registration policy includes:

in the system initialization stage, a registration sequence number is allocated to each GPU, and the registration sequence of each GPU is determined;

polling each GPU one by one according to the registration sequence number; when a certain GPU registers a host buffer, other GPUs cannot perform registration operation and need to continuously refresh data to an unregistered host memory page;

after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.

10. The method of claim 1, wherein prior to performing GPU cache incremental allocation according to the initialized hardware and software environment, gradually optimizing management and access of device caches, the method further comprises:

preparing software and hardware environment, generating two sub-threads, initializing equipment buffer and host buffer.