CN117435353A - Comprehensive optimization method for high-frequency checkpoint operation - Google Patents
Comprehensive optimization method for high-frequency checkpoint operation Download PDFInfo
- Publication number
- CN117435353A CN117435353A CN202311757384.7A CN202311757384A CN117435353A CN 117435353 A CN117435353 A CN 117435353A CN 202311757384 A CN202311757384 A CN 202311757384A CN 117435353 A CN117435353 A CN 117435353A
- Authority
- CN
- China
- Prior art keywords
- registration
- memory
- access
- host
- gpu
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000005457 optimization Methods 0.000 title claims abstract description 39
- 238000013507 mapping Methods 0.000 claims abstract description 70
- 239000000872 buffer Substances 0.000 claims abstract description 45
- 230000003111 delayed effect Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 44
- 230000008569 process Effects 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 238000011084 recovery Methods 0.000 abstract description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 8
- 238000007726 management method Methods 0.000 description 8
- 238000012546 transfer Methods 0.000 description 7
- 230000002860 competitive effect Effects 0.000 description 5
- 238000002360 preparation method Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 230000015556 catabolic process Effects 0.000 description 4
- 125000004122 cyclic group Chemical group 0.000 description 4
- 238000006731 degradation reaction Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The specification provides a comprehensive optimization method for high-frequency checkpoint operation, and relates to the technical field of data recovery. According to the initialized software and hardware environment, GPU cache increment allocation is carried out, and management and access of equipment caches are gradually optimized; mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed; concurrency control and registration contention are optimized by separate producer-consumer policies, host buffer serialization registration policies. The method solves the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment. The method provides a unique and effective solution to the high-frequency check point operation problem in the high-performance computing environment through reasonable cache optimization and concurrency control optimization means, and remarkably improves the performance and efficiency of the computing task.
Description
Technical Field
The document relates to the technical field of data recovery, in particular to a comprehensive optimization method for high-frequency checkpoint operation.
Background
In a high performance computing environment, short-term computing tasks require frequent checkpointing to ensure data reliability. Although the running time of these tasks is very short, typically only a few seconds or minutes, the results produced are very important and cannot be replaced. Such tasks are widely covered in the fields of scientific simulation, data analysis, machine learning training and the like. Due to the uncertainty of the computing process and the high degree of parallelism of the computing nodes, these tasks may be subject to interruption risks, such as hardware failures, resource contention, network interruptions, and the like. To ensure that completed computing states and intermediate results are restored even if an interrupt occurs, these short-term companion computing tasks require frequent checkpointing, saving the current state and data in persistent storage to resume computation from the most recent checkpoint after the interrupt.
However, high frequency checkpointing also presents new challenges, including computing and storage overhead, and the impact of checkpointing on computing performance. Current solutions include:
(1) One-time allocation and fixing of host memory pages:
this approach may result in expensive initialization costs, especially in the concurrent case. In addition, fixedhost memory pages may reduce concurrency performance because resources of the Fixedmemory pages may be contended between tasks.
(2) Direct access to host memory:
although initialization overhead may be reduced, performance may be limited due to underutilization of the advantages of GPU cache. Furthermore, concurrency performance between tasks may also be affected.
(3) Virtual memory management:
in a high performance computing environment, conventional virtual memory management methods may not fully exploit the capabilities of hardware devices, particularly when large-scale data transfer and high concurrent access are involved.
Therefore, an optimization scheme for high-frequency checkpointing is needed to solve the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment.
Disclosure of Invention
The specification provides a comprehensive optimization method oriented to high-frequency check point operation, which solves the problems of high initialization overhead and poor performance of the existing optimization method in concurrent tasks in a high-performance computing environment.
In a first aspect, the present disclosure provides a comprehensive optimization method for high-frequency checkpointing, including three parts, including GPU cache increment allocation, opportunistic access and delay registration of host memory, and optimization concurrency control, specifically including the steps of:
according to the initialized software and hardware environment, GPU cache increment allocation is carried out, and management and access of equipment caches are gradually optimized;
mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed;
concurrency control and registration contention are optimized by separate producer-consumer policies, host buffer serialization registration policies.
The invention has the beneficial effects that
The method integrates GPU cache optimization and host memory optimization strategies, and the method for gradually and incrementally distributing the GPU cache and opportunistically accessing the host memory is adopted, so that the problems of high initialization cost, limited concurrency performance and the like in the prior art are effectively solved. This comprehensive cache optimization strategy brings a completely new solution for high frequency checkpointing. Secondly, the device cache initialization and management strategy of the method is innovative. By initializing the equipment cache in a reserved and mapped mode, the performance overhead and the competitive influence caused by the traditional one-time allocation and the fixed host memory page are avoided. The strategy fully utilizes GPU hardware resources while ensuring performance, and realizes efficient calculation and storage performance. In addition, the method introduces opportunistic access and delay registration strategies of the host memory, and the host memory is mapped in advance through the operation of writing one byte, so that the delay after the system is started is reduced. Meanwhile, by delaying registration, the competition of registration of a host buffer area is effectively reduced, and the performance is further optimized. Meanwhile, the method avoids competition between opportunistic access and check point refreshing by introducing a separated producer-consumer strategy, thereby improving the stability and performance of the system. In addition, the serialized host buffer registration policy further reduces registration contention to better leverage hardware resources. In summary, the method provides a unique and effective solution to the high-frequency checkpointing problem in a high-performance computing environment through reasonable cache optimization, concurrency control optimization and other means, and remarkably improves the performance and efficiency of computing tasks.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a schematic diagram of a comprehensive optimization method for high frequency checkpointing according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a comprehensive optimization method flow for high-frequency checkpointing according to an embodiment of the present disclosure.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Detailed description of the preferred embodiments
The embodiment provides a comprehensive optimization method for high-frequency checkpointing operation, which is shown in fig. 1, and the workflow of the method is shown in fig. 2;
it should be noted that, the high-frequency checkpointing mainly occurs in the computing process requiring frequent iteration, and because the task runs briefly, the interrupt factor may cause the completed computing to be partially lost, so that the checkpointing becomes a key step for ensuring computing restorability.
Specifically, the method comprises GPU cache increment allocation, host memory opportunistic access and delay registration, optimization and
three main parts of hair control;
first, before performing step S1, the method further comprises: the software and hardware initialization step may be implemented as follows:
s01, software and hardware environment preparation
Before starting the initialization, it is necessary to prepare an appropriate software and hardware environment to ensure efficient implementation of the scheme. Specifically, the following requirements need to be met:
CUDA environment configuration: ensuring that the CUDA Toolkit is installed in the system so that the CUDA API function can be invoked. In addition, GPU hardware that supports CUDA is also required.
VELOC runtime Environment: ensuring that the VELOC runtime environment has been deployed and properly configured. This includes the VELOC library and related dependent items.
The multi-core processor: the system is ensured to have multiple cores available to generate sub-threads and to implement concurrent operations.
S02, generating a child thread
During initialization, two sub-threads are introduced for initialization of the device buffer and the host buffer, respectively, and subsequent asynchronous transfer operations. Therefore, the multi-core processor can be effectively utilized, concurrency is improved, and the initialization process is accelerated.
S03, initializing equipment cache
Initialization of the device cache involves two main functions: cumemaddresReserve () and cuMemMap ().
This CUDA API function is used to reserve a piece of memory address space on the device to ensure that subsequent allocations do not conflict with other memory regions.
The cuMemMap ()'s function is used to map reserved virtual address space stepwise onto physical HBM (High Bandwidth Memory) memory. The cost of mapping operation can be effectively managed through gradual mapping, and the mapping of pages is carried out when needed, so that the performance problem caused by mapping a large number of pages at one time is avoided.
S04, host buffer initialization
The initialization of the host buffer includes allocation of virtual memory and access of pages.
malloc (): this is a standard C library function that allocates a piece of virtual memory on the host. In this scheme, this function is used to allocate enough virtual memory for the host buffer for subsequent use.
Access to host memory: by accessing each page of the host buffer during the initialization phase, the operating system maps the physical page to a virtual address space for subsequent use.
Based on this, a basis is provided for subsequent operations by reasonable device cache and host memory initialization during the initialization phase.
S1, performing GPU cache increment allocation according to initialized software and hardware environments, and gradually optimizing management and access of equipment caches;
specifically, one specific implementation manner of step S1 may be:
s11, distributing equipment cache virtual memory: each sub-thread allocates virtual memory space for the GPU to which it belongs using a cumemadessreserve () function;
it should be noted that the cumemadd reserve () function plays an important role in the stage of device cache allocation, and by calling this function, each GPU allocates a virtual memory space for its cache, providing the necessary address space for the mapping operation. The allocated virtual address space will be used for subsequent device cache mapping, providing a basis for high-speed access of data.
S12, gradually mapping virtual cache: device cache virtual memory based on pre-allocation, through uDeviceGetAttribute ()
A function for determining an optimal block size for memory mapping; for each mapping block, allocating a physical memory page on the GPU using a cumemolloc () function; mapping the allocated physical memory pages into a virtual memory space by means of a cuMemMap () function, and establishing a mapping relation between virtual addresses and physical addresses; the size of the mapping area is gradually expanded in a sampling and repeated iteration mode, and the memory range of mapping is dynamically increased;
specific examples:
the device cache mapping stage needs to be carefully optimized to ensure that data is efficiently transferred between the device and the physical memory. The specific implementation steps are as follows:
(1) Selecting a mapping block size
By calling the cuDeviceGetAttribute () function, attribute information of the current GPU, in particular parameters related to the memory mapped optimal block size, is obtained. This step helps to select the appropriate mapping size to optimize data transmission efficiency.
It should be noted that, through the cuDeviceGetAttribute () function, attribute information of the GPU may be obtained, including an optimal block size of the memory map. Selecting an appropriate mapping block size according to hardware attributes helps to improve data transmission efficiency.
(2) Allocating physical memory pages
For each mapping block, a physical memory page is allocated on the GPU using a cumemolloc () function. These memory pages will be the basis for actual data storage, providing support for data transfer.
The cumemolloc () function allocates physical memory pages on the device to provide support for subsequent data transfer and storage operations.
(3) Mapping virtual memory
The previously allocated physical memory pages are mapped into virtual memory space by means of a cuMemMap () function. By establishing the mapping relation between the virtual address and the physical address, the GPU can directly access the memory pages, so that the access speed of data is greatly improved.
The cuMemMap () function is used to map the physical memory on the GPU into a virtual address space. Through this mapping, the GPU can directly access the physical memory, thereby achieving high-speed data access.
(4) Cyclic mapping
In order to adapt to the requirements of different calculation stages, the size of the mapping area is gradually expanded in a multi-iteration mode. This cyclic mapping strategy allows the memory range of the map to be dynamically increased in response to changes in the amount of data. By adjusting the size of the mapping at different stages, the data access requirements can be better matched, and the performance is improved.
S13, enabling access rights: and setting an access right mark according to the ID of the GPU and the starting address of the mapping region, and utilizing a cuMemSetAccess () function to ensure that data access operations among different GPUs can be performed simultaneously without conflict.
Specific examples:
to support concurrent access of data between multiple GPUs, access permissions are enabled for the mapped region of each GPU using a cuMemSetAccess () function. By setting the access right mark according to the ID of the GPU and the starting address of the mapping area, the data access operation between different GPUs can be ensured to be carried out simultaneously without conflict, thereby improving the overall calculation performance.
The function plays a key role in the concurrent access of data in a multi-GPU environment. By calling this function, access rights can be set for the mapped region of each GPU, ensuring that data access operations can be performed in a concurrent manner.
Based on the method, in the GPU cache increment allocation stage, the management and access of the equipment cache are optimized through the steps of selecting the size of a mapping block, allocating physical memory pages, mapping virtual memory and the like. Meanwhile, by enabling the access rights, data concurrent access among multiple GPUs is realized.
S2, mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed;
in order to allocate physical memory in advance and reduce registration overhead in the system initialization stage, the embodiment introduces an opportunistic access policy of the host memory. Through the strategy, the memory of the host can be effectively managed, and the system performance and the resource utilization rate are improved.
It should be noted that, the access of the host memory page may cause the reservation and mapping operations of the corresponding physical memory page. However, the registration overhead of Fixedmemory pages is relatively small. Based on this principle, a policy of delaying registration is adopted. Specifically, registration of pages of host memory is deferred until all pages have been accessed. The method has the advantage that the cost brought by registration operation is reduced on the premise of ensuring that the memory page is mapped.
Specifically, a specific implementation manner of step S2 may be:
s21, opportunistic access of pages: when the system is started, writing a byte into a host memory page which is not registered, and triggering an operating system to reserve and map a corresponding physical memory page into a virtual memory page in advance;
specific examples:
(1) Page opportunistic access
In the initialization stage, an opportunistic access policy of the host memory page is adopted. Specifically, a byte is written into a host memory page that has not been registered, and the operating system is triggered to reserve and map a corresponding physical memory page into a virtual memory page in advance. The main purpose of this step is to map the host memory to the virtual address space in advance when the system is started, avoiding the occurrence of page exchange during the system operation, and thus reducing the delay.
Opportunistic access does not involve the reading and writing of large-scale data, but rather merely the operation of writing one byte. This is because the goal is to trigger page mapping, not the processing of the actual data. Thus, by writing one byte, the operating system can be made aware that the pages need to be reserved and mapped in preparation for subsequent accesses.
S22, delay registration: when the memory is accessed and registered, the registration operation of the host memory page is performed after all the pages are accessed.
It should be noted that, the key to implementing the delayed registration is the tracking and recording of the page access. During the memory access process, we need to record the access condition of each host memory page, so as to register after all pages are accessed.
Further, one specific implementation manner of step S22 may be:
s221, access tracking: monitoring access of memory pages of a host computer each time in the memory access process;
it should be noted that this may be achieved by a hardware performance counter or inserting code when accessing a page.
S222, page record: marking each accessed page and recording in a list;
specifically, for each page accessed, it is marked as "accessed" and recorded in a list. This list will contain all host memory pages that have been accessed.
S223, delaying registration triggering: when all pages are accessed, the system triggers a delay registration operation; traversing the list of accessed pages, registering each page in the list, and fixing the physical memory mapping of the page.
Based on this, by accessing and delaying the registration phase in the host memory opportunistic access, the host memory is allocated in advance and registration overhead is reduced by page opportunistic access and delaying the registration policy.
S3, optimizing concurrency control and registration competition through separated producer-consumer strategies and host buffer zone serialization registration strategies.
It should be noted that, in order to optimize the competition between opportunistic access and checkpointing refresh, the present embodiment proposes a separate producer-consumer strategy. By efficiently allocating resources and managing memory accesses, concurrency contention can be minimized, thereby optimizing the alternation between refresh and opportunistic accesses.
In order to effectively solve the registration competition problem of the host buffer, the embodiment introduces a serialized host buffer registration policy. The strategy effectively reduces performance degradation caused by competition and reduces overall overhead.
Specifically, a specific implementation manner of step S3 may be:
s31, alternately executing a check point refreshing task and executing opportunistic access according to the separated producer-consumer policy;
wherein the producer-consumer policy comprises: a producer-consumer model; wherein,
the producer-consumer model refers to a concurrent programming model in which tasks are divided into two roles: a producer and a consumer, wherein the producer refers to a task that performs opportunistic access and the consumer refers to a task that performs checkpoint refresh.
By definitely dividing the two roles of producer and consumer, the competitive condition can be avoided, thereby improving the system efficiency.
Specifically, one specific implementation manner of step S31 may be:
s311, producer optimization: the producer accesses the memory pages of the host according to ascending order;
such sequential access may reduce contention between memory page accesses, thereby reducing latency. In a specific implementation, the sequence of page access can be adjusted, so that access of adjacent pages cannot jump greatly, and the locality of memory access is improved.
S312, consumer optimization: based on a batch refreshing strategy and an asynchronous refreshing strategy, a consumer refreshes the accessed page to a storage layer so as to keep the consistency of data;
further, the specific implementation manner of step S312 may be:
(1) Batch refreshing: combining the refresh operations of the plurality of pages into a batch refresh operation;
this can reduce the number of refresh operations, thereby reducing the overhead incurred by refresh.
(2) Asynchronous refresh: the refresh operation is asynchronously performed in parallel with other tasks.
By asynchronous refreshing, the bandwidth of the storage layer can be fully utilized, and the concurrency of the system is improved.
Based on this, by employing separate producer-consumer policies, separation of opportunistic access and checkpoint refresh is achieved, thereby reducing concurrency competition. The isolation between producer and consumer can effectively avoid competitive conditions and improve the stability and performance of the system.
S32, according to the host buffer serialization registration strategy, each GPU is allowed to register the host buffer one by one in a polling registration mode, and other GPUs continue to refresh to unregistered memory pages.
Specifically, one specific implementation of step S32 may be:
s321, registration sequence control: in the system initialization stage, a registration sequence number is allocated to each GPU, and the registration sequence of each GPU is determined;
these numbers may be assigned based on information of the GPU's physical location, device ID, etc., thereby achieving an orderly registration order.
S322, polling and registering process: polling each GPU one by one according to the registration sequence number; when a certain GPU registers a host buffer, other GPUs cannot perform registration operation and need to continuously refresh data to an unregistered host memory page;
s323, refreshing unregistered memory: after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.
By polling the registration strategy, a simple and efficient contention optimization scheme is achieved. Under this strategy, each GPU has the opportunity to complete registration of the host buffer in sequence, while other GPUs maintain synchronization of data by continuous refresh operations, thereby avoiding excessive registration contention, reducing performance degradation and overhead increase.
In summary, the embodiment combines the GPU cache optimization and the host memory optimization strategy, and adopts the method of gradually and incrementally allocating the GPU cache and opportunistically accessing the host memory, thereby effectively overcoming the problems of high initialization cost, limited concurrency performance and the like in the prior art. This comprehensive cache optimization strategy brings a completely new solution for high frequency checkpointing. Secondly, the device cache initialization and management strategy of the method is innovative. By initializing the equipment cache in a reserved and mapped mode, the performance overhead and the competitive influence caused by the traditional one-time allocation and the fixed host memory page are avoided. The strategy fully utilizes GPU hardware resources while ensuring performance, and realizes efficient calculation and storage performance. In addition, the method introduces opportunistic access and delay registration strategies of the host memory, and the host memory is mapped in advance through the operation of writing one byte, so that the delay after the system is started is reduced. Meanwhile, by delaying registration, the competition of registration of a host buffer area is effectively reduced, and the performance is further optimized. Meanwhile, the method avoids competition between opportunistic access and check point refreshing by introducing a separated producer-consumer strategy, thereby improving the stability and performance of the system. In addition, the serialized host buffer registration policy further reduces registration contention to better leverage hardware resources. In summary, the method provides a unique and effective solution to the high-frequency checkpointing problem in a high-performance computing environment through reasonable cache optimization, concurrency control optimization and other means, and remarkably improves the performance and efficiency of computing tasks.
Second embodiment
The embodiment provides a comprehensive optimization method for high-frequency checkpointing operation, so as to optimize checkpointing operation in a multi-GPU environment, and the flow of the method is shown in FIG. 2; the method specifically comprises the following steps:
step 1, initializing preparation
1.1 Software and hardware environment preparation
Before starting the initialization, it is necessary to prepare an appropriate software and hardware environment to ensure efficient implementation of the scheme. Specifically, the following requirements need to be met:
CUDA environment configuration: ensuring that the CUDA Toolkit is installed in the system so that the CUDA API function can be invoked. In addition, GPU hardware that supports CUDA is also required.
VELOC runtime Environment: ensuring that the VELOC runtime environment has been deployed and properly configured. This includes the VELOC library and related dependent items.
The multi-core processor: the system is ensured to have multiple cores available to generate sub-threads and to implement concurrent operations.
1.2 Generating sub-threads
During initialization, two sub-threads are introduced for initialization of the device buffer and the host buffer, respectively, and subsequent asynchronous transfer operations. Therefore, the multi-core processor can be effectively utilized, concurrency is improved, and the initialization process is accelerated.
1.3 Device cache initialization
Initialization of the device cache involves two main functions: cumemaddresReserve () and cuMemMap ().
This CUDA API function is used to reserve a piece of memory address space on the device to ensure that subsequent allocations do not conflict with other memory regions.
The cuMemMap ()'s function is used to map reserved virtual address space stepwise onto physical HBM (High Bandwidth Memory) memory. The cost of mapping operation can be effectively managed through gradual mapping, and the mapping of pages is carried out when needed, so that the performance problem caused by mapping a large number of pages at one time is avoided.
1.4 Host buffer initialization
The initialization of the host buffer includes allocation of virtual memory and access of pages.
malloc (): this is a standard C library function that allocates a piece of virtual memory on the host. In this scheme, it is used to allocate enough virtual memory for host buffers for later use.
Access to host memory: by accessing each page of the host buffer during the initialization phase, the operating system will map the physical page to the virtual address space for subsequent use.
Step 2 GPU cache increment Allocation
2.1 Distributing equipment buffer virtual memory
In the initialization phase, each child thread allocates virtual memory space for the GPU to which it belongs using a cumemaddress reserve () function. This virtual address space will be used for subsequent device cache mapping, providing the basis for high-speed access of data.
This critical function plays an important role in the phase of device cache allocation. By invoking this function, each GPU allocates virtual memory space for its cache, providing the necessary address space for the mapping operation.
2.2 Gradual mapping virtual cache
The device cache mapping stage needs to be carefully optimized to ensure that data is efficiently transferred between the device and the physical memory. The specific implementation steps are as follows:
selecting a mapping block size:
by calling the cuDeviceGetAttribute () function, attribute information of the current GPU, in particular parameters related to the memory mapped optimal block size, is obtained. This step helps to select the appropriate mapping size to optimize data transmission efficiency.
Through this function, the attribute information of the GPU, including the optimal block size of the memory map, can be obtained. Selecting an appropriate mapping block size according to hardware attributes helps to improve data transmission efficiency.
Physical memory pages are allocated: for each mapping block, a physical memory page is allocated on the GPU using a cumemolloc () function. These memory pages will be the basis for actual data storage, providing support for data transfer.
cuMemAlloc () this function allocates physical memory pages on the device, providing support for subsequent data transfer and storage operations.
Mapping virtual memory: the previously allocated physical memory pages are mapped into virtual memory space by means of a cuMemMap () function. By establishing the mapping relation between the virtual address and the physical address, the GPU can directly access the memory pages, so that the access speed of data is greatly improved.
This function is critical for mapping physical memory on the GPU into virtual address space. Through this mapping, the GPU can directly access the physical memory, thereby achieving high-speed data access.
Cyclic mapping: in order to adapt to the requirements of different calculation stages, the size of the mapping area is gradually expanded in a multi-iteration mode. This cyclic mapping strategy allows the memory range of the map to be dynamically increased in response to changes in the amount of data. By adjusting the size of the mapping at different stages, the data access requirements can be better matched, and the performance is improved.
2.3 Enabling access rights
To support concurrent access of data between multiple GPUs, access permissions are enabled for the mapped region of each GPU using a cuMemSetAccess () function. By setting the access right mark according to the ID of the GPU and the starting address of the mapping area, the data access operation between different GPUs can be ensured to be carried out simultaneously without conflict, thereby improving the overall calculation performance.
The function plays a key role in the concurrent access of data in a multi-GPU environment. By calling this function, access rights can be set for the mapped region of each GPU, ensuring that data access operations can be performed in a concurrent manner.
Step 3, opportunistic access and delayed registration of host memory
In order to allocate physical memory in advance and reduce registration overhead in the system initialization stage, the scheme introduces an opportunistic access policy of the host memory. Through the strategy, the memory of the host can be effectively managed, and the system performance and the resource utilization rate are improved. The following are the detailed implementation steps of the strategy:
3.1 Page opportunistic access
In the initialization stage, an opportunistic access policy of the host memory page is adopted. Specifically, a byte is written into a host memory page that has not been registered, and the operating system is triggered to reserve and map a corresponding physical memory page into a virtual memory page in advance. The main purpose of this step is to map the host memory to the virtual address space in advance when the system is started, avoiding the occurrence of page exchange during the system operation, and thus reducing the delay.
Opportunistic access does not involve the reading and writing of large-scale data, but rather merely the operation of writing one byte. This is because the goal is to trigger page mapping, not the processing of the actual data. Thus, by writing one byte, the operating system can be made aware that the pages need to be reserved and mapped in preparation for subsequent accesses.
3.2 Delay registration
The access of the host memory page may cause reservation and mapping operations of the corresponding physical memory page. However, the registration overhead of Fixedmemory pages is relatively small. Based on this principle, a policy of delaying registration is adopted. Specifically, registration of pages of host memory is deferred until all pages have been accessed. The method has the advantage that the cost brought by registration operation is reduced on the premise of ensuring that the memory page is mapped.
The key to achieving deferred registration is the tracking and recording of page accesses. In the memory access process, the access condition of each host memory page needs to be recorded so as to register after all pages are accessed. The following steps may be taken:
and (3) access tracking, namely monitoring access of the memory pages of the host computer each time in the memory access process. This may be achieved by a hardware performance counter or inserting code when accessing the page.
Page record-for each page accessed, it is marked as "accessed" and recorded in a list. This list will contain all host memory pages that have been accessed.
Delay registration trigger:
when all pages have been accessed, the system triggers a deferred registration operation. At this stage, the list of accessed pages is traversed, each page therein is registered, and its physical memory map is fixed.
Step 4, optimizing concurrency control
To optimize the competition between opportunistic access and checkpoint refresh, the present scheme proposes a separate producer-consumer strategy. By efficiently allocating resources and managing memory accesses, concurrency contention can be minimized, thereby optimizing the alternation between refresh and opportunistic accesses. The following are the detailed implementation steps of the isolated producer-consumer strategy, and further optimization details:
4.1 Producer-consumer model
The producer-consumer model is a concurrent programming model in which tasks are divided into two roles: producer and consumer. In the context, producer refers to a task that performs opportunistic access, while consumer refers to a task that performs checkpoint refresh. By explicitly dividing these two roles, race conditions can be avoided, thereby improving system efficiency.
4.2 Isolated producer-consumer strategy
In order to achieve a separate producer-consumer strategy, the following optimization measures are employed:
optimizing the producer: the producer accesses the host memory pages in ascending order. Such sequential access may reduce contention between memory page accesses, thereby reducing latency. In a specific implementation, the sequence of page access can be adjusted, so that access of adjacent pages cannot jump greatly, and the locality of memory access is improved.
Consumer optimization: the consumer refreshes the accessed page to the storage layer to maintain data consistency. To optimize consumer performance, the following strategies may be employed:
batch refreshing: the refresh operations for multiple pages are combined into one bulk refresh operation. This may reduce the number of refresh operations, thereby reducing the overhead incurred by refresh.
Asynchronous refresh: the refresh operation is asynchronously performed in parallel with other tasks. By asynchronous refreshing, the bandwidth of the storage layer can be fully utilized, and the concurrency of the system is improved.
By employing separate producer-consumer policies, separation of opportunistic access and checkpoint refresh is achieved, thereby reducing concurrency competition. The isolation between producer and consumer can effectively avoid competitive conditions and improve the stability and performance of the system.
Step 5 host buffer serialization
In the scheme, in order to effectively solve the registration competition problem of the host buffer, a serialized host buffer registration strategy is introduced. The core idea of the strategy is to allow each GPU to register the host buffer area one by one in a polling registration mode, and other GPUs continue to refresh the unregistered memory pages, so that performance degradation caused by competition is effectively reduced, and overall cost is reduced.
5.1 Detailed implementation of a poll registration policy
The implementation steps of the strategy are as follows:
registration order control: in the system initialization phase, each GPU is assigned a registration order number to determine the order in which they are registered. These numbers may be assigned based on information of the GPU's physical location, device ID, etc., thereby achieving an orderly registration order.
Polling registration process: each GPU is polled one by one according to the registration order number. When a GPU registers with a host buffer, other GPUs are blocked and cannot register, but continue to refresh data to unregistered host memory pages.
Refreshing unregistered memory: after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.
By polling the registration strategy, a simple and efficient contention optimization scheme is achieved. Under this strategy, each GPU has the opportunity to complete registration of the host buffer in sequence, while other GPUs maintain synchronization of data by continuous refresh operations, thereby avoiding excessive registration contention, reducing performance degradation and overhead increase.
In summary, the present embodiment comprehensively uses methods such as GPU cache optimization, host memory optimization, concurrency control optimization, and the like, and solves the challenges faced by high-frequency checkpointing in the prior art by gradually and incrementally allocating GPU caches and opportunistic access policies to host memory. In the specific implementation, the scheme is initialized through reasonable equipment cache and host memory in the initialization stage, and a basis is provided for subsequent operation. In the GPU cache increment allocation stage, the management and access of equipment caches are optimized through the steps of selecting the size of a mapping block, allocating physical memory pages, mapping virtual memory and the like. Meanwhile, by enabling the access rights, data concurrent access among multiple GPUs is realized. In the stage of opportunistic access and delayed registration of the host memory, the host memory is allocated in advance and registration overhead is reduced through page opportunistic access and delayed registration strategies. Finally, concurrent control and registration contention is optimized through separate producer-consumer policy and host buffer serialized registration, enabling high performance checkpointing. The technical scheme has remarkable application value in the field of high-performance calculation, and can improve the calculation performance and efficiency.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.
Claims (10)
1. A comprehensive optimization method for high-frequency checkpointing operation, comprising:
according to the initialized software and hardware environment, GPU cache increment allocation is carried out, and management and access of equipment caches are gradually optimized;
mapping the host memory page to a virtual address space in advance by adopting an opportunistic access strategy and a delayed registration strategy, and performing registration operation of the host memory page after all pages are accessed;
concurrency control and registration contention are optimized by separate producer-consumer policies, host buffer serialization registration policies.
2. The method according to claim 1, wherein the performing GPU cache increment allocation according to the initialized hardware and software environment gradually optimizes management and access of the device cache, includes:
each sub-thread allocates virtual memory space for the GPU to which it belongs using a cumemadessreserve () function;
determining the optimal block size of the memory mapping through a uDeviceGetAttribute () function based on the pre-allocated device cache virtual memory; for each mapping block, allocating a physical memory page on the GPU using a cumemolloc () function; mapping the allocated physical memory pages into a virtual memory space by means of a cuMemMap () function, and establishing a mapping relation between virtual addresses and physical addresses; the size of the mapping area is gradually expanded in a sampling and repeated iteration mode, and the memory range of mapping is dynamically increased;
and setting an access right mark according to the ID of the GPU and the starting address of the mapping region, and utilizing a cuMemSetAccess () function to ensure that data access operations among different GPUs can be performed simultaneously without conflict.
3. The method of claim 1, wherein the mapping the host memory page to the virtual address space in advance using the opportunistic access policy and the deferred registration policy, and performing the registration of the host memory page after all the pages have been accessed, comprises:
when the system is started, writing a byte into a host memory page which is not registered, and triggering an operating system to reserve and map a corresponding physical memory page into a virtual memory page in advance;
when the memory is accessed and registered, the registration operation of the host memory page is performed after all the pages are accessed.
4. The method of claim 3, wherein the accessing and registering the memory after all the pages have been accessed comprises:
monitoring access of memory pages of a host computer each time in the memory access process;
marking each accessed page and recording in a list;
when all pages are accessed, the system triggers a delay registration operation; traversing the list of accessed pages, registering each page in the list, and fixing the physical memory mapping of the page.
5. The method of claim 1, wherein optimizing concurrency control and registration contention by separate producer-consumer policies, host buffer serialization registration policies, comprises:
alternately executing the checkpoint refresh task and executing the opportunistic access according to the separated producer-consumer policies;
according to the host buffer serialization registration strategy, each GPU is allowed to register the host buffer one by one in a polling registration mode, and other GPUs continue to refresh to unregistered memory pages.
6. The method of claim 5, wherein the tasks in the producer-consumer policy are divided into: a producer and a consumer, wherein the producer refers to a task that performs opportunistic access and the consumer refers to a task that performs checkpoint refresh.
7. The method of claim 6, wherein the alternating execution of checkpoint refresh tasks with execution of opportunistic accesses according to separate producer-consumer policies comprises:
the producer accesses the memory pages of the host according to ascending order;
based on the batch refreshing strategy and the asynchronous refreshing strategy, the consumer refreshes the accessed pages to the storage layer so as to maintain the consistency of the data.
8. The method of claim 7, wherein the bulk refresh policy refers to merging refresh operations of multiple pages into one bulk refresh operation;
the asynchronous refresh policy refers to asynchronously executing refresh operations in parallel with other tasks.
9. The method according to claim 5, wherein the allowing each GPU to register its host buffer one by one while the other GPUs continue to refresh to unregistered memory pages by polling registration according to a host buffer serialization registration policy includes:
in the system initialization stage, a registration sequence number is allocated to each GPU, and the registration sequence of each GPU is determined;
polling each GPU one by one according to the registration sequence number; when a certain GPU registers a host buffer, other GPUs cannot perform registration operation and need to continuously refresh data to an unregistered host memory page;
after a certain GPU completes registration of the host buffer, other unregistered GPUs continue to perform refresh operations to maintain data consistency.
10. The method of claim 1, wherein prior to performing GPU cache incremental allocation according to the initialized hardware and software environment, gradually optimizing management and access of device caches, the method further comprises:
preparing software and hardware environment, generating two sub-threads, initializing equipment buffer and host buffer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311757384.7A CN117435353B (en) | 2023-12-20 | 2023-12-20 | Comprehensive optimization method for high-frequency checkpoint operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311757384.7A CN117435353B (en) | 2023-12-20 | 2023-12-20 | Comprehensive optimization method for high-frequency checkpoint operation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117435353A true CN117435353A (en) | 2024-01-23 |
CN117435353B CN117435353B (en) | 2024-03-29 |
Family
ID=89546451
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311757384.7A Active CN117435353B (en) | 2023-12-20 | 2023-12-20 | Comprehensive optimization method for high-frequency checkpoint operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117435353B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893274A (en) * | 2016-05-11 | 2016-08-24 | 华中科技大学 | Device for building checkpoints for heterogeneous memory system |
US11531485B1 (en) * | 2021-09-07 | 2022-12-20 | International Business Machines Corporation | Throttling access to high latency hybrid memory DIMMs |
CN116149818A (en) * | 2023-02-10 | 2023-05-23 | 阿里云计算有限公司 | Migration method, equipment, system and storage medium of GPU (graphics processing Unit) application |
CN116455972A (en) * | 2023-06-16 | 2023-07-18 | 中国人民解放军国防科技大学 | Method and system for realizing simulation middleware based on message center communication |
-
2023
- 2023-12-20 CN CN202311757384.7A patent/CN117435353B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893274A (en) * | 2016-05-11 | 2016-08-24 | 华中科技大学 | Device for building checkpoints for heterogeneous memory system |
US11531485B1 (en) * | 2021-09-07 | 2022-12-20 | International Business Machines Corporation | Throttling access to high latency hybrid memory DIMMs |
CN116149818A (en) * | 2023-02-10 | 2023-05-23 | 阿里云计算有限公司 | Migration method, equipment, system and storage medium of GPU (graphics processing Unit) application |
CN116455972A (en) * | 2023-06-16 | 2023-07-18 | 中国人民解放军国防科技大学 | Method and system for realizing simulation middleware based on message center communication |
Also Published As
Publication number | Publication date |
---|---|
CN117435353B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10310973B2 (en) | Efficient memory virtualization in multi-threaded processing units | |
US10037228B2 (en) | Efficient memory virtualization in multi-threaded processing units | |
US8458721B2 (en) | System and method for implementing hierarchical queue-based locks using flat combining | |
US9268698B1 (en) | Method and system for maintaining context event logs without locking in virtual machine | |
US9262174B2 (en) | Dynamic bank mode addressing for memory access | |
CN108268385B (en) | Optimized caching agent with integrated directory cache | |
US20070157200A1 (en) | System and method for generating a lock-free dual queue | |
US20040205304A1 (en) | Memory allocator for a multiprocessor computer system | |
RU2641244C2 (en) | Unified access to jointly used and controlled memory | |
US20080141268A1 (en) | Utility function execution using scout threads | |
US9448934B2 (en) | Affinity group access to global data | |
CN113674133A (en) | GPU cluster shared video memory system, method, device and equipment | |
US11301142B2 (en) | Non-blocking flow control in multi-processing-entity systems | |
WO2023184900A1 (en) | Processor, chip, electronic device, and data processing method | |
US10733101B2 (en) | Processing node, computer system, and transaction conflict detection method | |
CN111897651B (en) | Memory system resource management method based on label | |
CN114780025B (en) | Software RAID request processing method, controller and RAID storage system | |
US9792209B2 (en) | Method and apparatus for cache memory data processing | |
KR101943312B1 (en) | Flash-based accelerator and computing device including the same | |
EP3662376A1 (en) | Reconfigurable cache architecture and methods for cache coherency | |
Chen et al. | Concurrent hash tables on multicore machines: Comparison, evaluation and implications | |
JP2023527770A (en) | Inference in memory | |
CN117435353B (en) | Comprehensive optimization method for high-frequency checkpoint operation | |
US10303375B2 (en) | Buffer allocation and memory management | |
García-Guirado et al. | Energy-efficient cache coherence protocols in chip-multiprocessors for server consolidation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |