CN108491159B - Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay - Google Patents
Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay Download PDFInfo
- Publication number
- CN108491159B CN108491159B CN201810188654.XA CN201810188654A CN108491159B CN 108491159 B CN108491159 B CN 108491159B CN 201810188654 A CN201810188654 A CN 201810188654A CN 108491159 B CN108491159 B CN 108491159B
- Authority
- CN
- China
- Prior art keywords
- time
- writing
- checkpoint
- write
- calculation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0613—Improving I/O performance in relation to throughput
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Retry When Errors Occur (AREA)
Abstract
The invention discloses a large-scale parallel system check point data writing method for relieving I/O bottleneck based on random time delay. The invention uses the random delay checkpoint file processing method to determine the preset delay write-in time and disperses the write-in operation in time, thereby reducing the I/O write-in peak value at the same moment and achieving the purpose of relieving the I/O bottleneck. Before the large-scale parallel system executes I/O operation, the associated data information of the large-scale parallel system is periodically detected, if the operation of the application program is influenced, the delay operation is abandoned and the write operation is immediately executed, so that the influence of occupying shared resources for a long time on the normal operation of the application program is avoided; otherwise, continuing to write according to the determined delay writing time. The invention can reduce the pressure on the I/O subsystem caused by the traditional centralized writing mode applied to different system platforms, and obtain higher throughput rate and shorter global blocking time.
Description
Technical Field
The invention relates to a processing method for dynamically adjusting the optimal writing time of Checkpoint (Checkpoint) data in the field of high-performance computing, in particular to a Checkpoint data writing control method for relieving I/O bottleneck caused by centralized writing of Checkpoint data in a large-scale parallel system.
Background
The high-performance computing mostly adopts a large-scale parallel computing mode, and a high-performance computing system comprises three major parts, namely a computing node, a network interconnection system and a storage system on a hardware infrastructure. The storage system comprises a plurality of I/O nodes and external storage equipment, wherein the I/O nodes run a parallel file system, respond to read-write requests of the computing nodes and realize management and scheduling of the external storage equipment.
With the continuous expansion of the scale of high performance computing systems, when a parallel program runs for a long time by using a large number of computing nodes, software or hardware errors in part of the nodes usually occur, which brings new challenges to the reliability of the system. The Mean Time Between Failures (MTBF) of the current super computer complete system is reduced to several hours. Most of existing long-time running programs belong to MPI (message passing interface) programs, and when an error occurs in a certain node, the MPI program running in the current node may be stopped or stopped, so that the programs in all the nodes must be re-executed, all previous calculation results are lost, and the serious waste of resources is undoubtedly caused. Meanwhile, in a large-scale high-performance computing system, since the mean time between failures is only several hours, in the worst case, the program repeatedly restarts to be executed and cannot be finally executed. Therefore, in order to enable the parallel program to execute correctly, a rollback recovery (rollback recovery) technique is widely used in high-performance computing as a fault-tolerant technique. One class of representatives is checkpoint software.
The checkpoint software periodically saves the relevant information of the application program in all the corresponding nodes at the current moment to form a checkpoint file set consisting of single-node checkpoint files of all the nodes, and then writes the set into the stable storage through the I/O nodes. When a node has an error, checkpoint software reads out the previous checkpoint data, creates a process according to the record, recovers the data, and further recovers the execution of an application program, so as to achieve the purpose of storing the previous operation result.
After the advent of checkpoint-based fault tolerance techniques, how to reduce the overhead incurred during checkpointing has become a major issue in the research of checkpointing techniques. Most checkpointing software, to date, has focused on reducing the amount of checkpoint data that a single node single checkpoint operation saves as a relevant research effort to reduce checkpoint overhead. However, in a node network application scenario of a large-scale or super-large-scale cluster, because the number of nodes in the computing node set is too large, the best effect still cannot be achieved if only the size of the data volume needing to be stored by the checkpoint software is considered. In the currently mainstream checkpoint software using a coordinated synchronization protocol (coordinated), before performing a checkpoint operation, a global synchronization operation is performed on all nodes to achieve a globally consistent state, so as to avoid a possible domino effect (consecutive rollback due to global state inconsistency). After the operation of collecting the checkpoint data is completed, the checkpoint software defaults to directly write the checkpoint data into the external storage system so as to cope with the node down fault which may occur later (if a certain node is down and the checkpoint data of the current node is not written into the stable storage, the corresponding checkpoint data of the node is lost, and the operation of the process related to the node cannot be recovered). Because the number of I/O nodes in the system is far less than that of the computing nodes, the centralized writing of check point data by the computing nodes with a large number of numbers impacts the I/O system, and further forms a system bottleneck, and the problem is more prominent along with the increase of the scale of a high-performance computing system.
For checkpoint software in a massively parallel system, reducing the impact of the checkpoint data writing process on the I/O subsystem is an important indicator for achieving checkpoint software availability. From the bottom level, it is the control of the use of the system shared I/O bandwidth resources. To better control the peak of I/O bandwidth usage, the centralized I/O request may be time resolved to some extent: the check point data is firstly cached in the memory of the current node, the independent writing module is responsible for processing, and then the writing operation is dispersed into a time interval, so that the total operation amount of I/O writing at the same time is reduced. And meanwhile, a feedback regulation mechanism is introduced, in the delay waiting process, the hardware use information of the current computing node is periodically detected, and the information such as the CPU use rate, the memory occupancy rate and the like is used as feedback information and provided to a controller of a write-in module for use, so that a reasonable I/O write-in time strategy is finally obtained.
For the checkpoint software using the coordination protocol, determining the parallelism degree without considering the use condition of system I/O will cause a certain impact on the I/O subsystem, so how to dynamically determine the optimal write-in timing corresponding to the checkpoint software data according to the comprehensive hardware conditions and real-time load conditions of different systems in a self-adaptive manner becomes the key to solve the problem. Aiming at the problems, the invention provides a large-scale parallel system check point data writing method based on a random delay I/O bottleneck.
Disclosure of Invention
The invention discloses a large-scale parallel system checkpoint (checkpoint) data writing method for relieving system I/O bottleneck through random delay. The method separates the checkpoint recovery process and the write-in process, temporarily caches checkpoint data in a memory after the checkpoint data is generated by each node in the system, calculates the delayed write-in time of the checkpoint data by using a corresponding random delayed checkpoint file processing method, and writes the checkpoint data into an external storage subsystem after the delayed write-in time is timed out. In the delay waiting process, the use information of the related hardware is periodically detected, and if the operation of the existing application program is influenced, the delay is abandoned to immediately execute the write-in operation, so that the influence of occupying hardware resources for a long time on the normal operation of the related program is avoided. Compared with the traditional checkpoint data centralized writing mode, the method has the advantages that the checkpoint data writing operation of each node is dispersed in time, the peak value formed by writing checkpoint data into an external storage system simultaneously by all nodes is avoided, the I/O bottleneck of the system can be relieved, and the expandability of a checkpoint system is improved.
The invention relates to a large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay, which specifically comprises the following steps:
step A, after the write-in module (20) finishes the buffer storage of the associated data information, the current time is obtained as the starting time point t of the time sectionOpener;
Step B, calculating operation node set BP ═ BPb,bpb+1,…,bpcEach node in the time section obtains an end time point t of the time section by using a random delay checkpoint file processing methodStop block;
Step C, determining the writing time zone [ t ]Opener,tStop block]Then, recording and calculating a running node set BP ═ BPb,bpb+1,…,bpcAn independent random value existing at each node in the };
step D, under the determined random value, operating node set BP ═ BP { for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position of (a);
step E, under the determined relative time position, calculating the operation node set BP ═ BPb,bpb+1,…,bpcDetermining a preset delay writing time by each node in the data processing unit; uniformly distributing the preset delay writing time to the whole writing time zone [ t ] according to time sequenceOpener,tStop block]Removing to obtain a time axis;
step F, judging whether the current program running time reaches the preset delay writing time, if so, executing step J; if not, executing the step G;
step G, recording a calculation operation node set BP ═ BP { in the same periodb,bpb+1,…,bpcFeedback information of each node in the } is obtained;
step H, obtaining evaluation parameters through feedback informationAnd will beWith a predetermined threshold value KThreshold valueMake a comparison ifExecuting step J; if it isExecuting the step I;
step I, when it is satisfiedWhen the time is longer than the preset time, the local operation environment is allowed to continue delaying, and the step F is carried out;
and step J, writing the cached associated data information into an external storage system (40), and finishing the data writing operation of the current delay check point.
The method for writing the checkpoint data of the large-scale parallel system based on the random delay relieving I/O bottleneck has the advantages that:
① the invention aims at the checkpoint program using the coordination protocol, and can slow down the write peak of checkpoint data by delaying the write, thereby obtaining higher throughput rate.
② the invention uses random delay checkpoint file processing method to determine the delay write time of checkpoint data, and at the same time, the invention can correspondingly adjust the optimal write time of checkpoint data according to the change of system load.
③ each node of the invention independently uses the random delay checkpoint file processing method to calculate the delay write time, does not depend on the centralized global scheduling control, is beneficial to reducing the extra overhead brought by the global synchronous operation in the large-scale parallel computing environment, reduces the processing time and improves the expandability of the checkpoint software.
④ the active latency technique of the present invention is not found in conventional I/O optimization techniques, and the active latency effectively reduces write conflicts and reduces I/O performance loss caused by periodic large simultaneous write operations.
Drawings
FIG. 1 is a flow diagram showing modules in a parallel process of adjusting checkpoint process writes according to the present invention.
FIG. 2 is a schematic diagram of a dynamic adjustment process for a predicted delay write time of any compute run node using the checkpoint writing method of the present invention.
Fig. 3 is a comparison graph of the delay time of the operation of writing a checkpoint file under the same bandwidth and node count.
FIG. 4 is a graph illustrating the efficiency of the delayed write time and the total write time of checkpoint data in I/O contention in a massively parallel system.
10. |
20. Write-in |
30. |
40. External storage system |
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Referring to fig. 1, in the process of writing checkpoint data in the massively parallel system, an execution module 10, a write module 20 and a recovery module 30 are used to relieve the impact on the I/O subsystem, and after the write delay time is timed out, checkpoint data is written into an external storage system 40.
(A) When the large-scale parallel system enters the checkpoint operation, the execution module 10 firstly suspends the process, and then the execution module 10 completes each item of synchronization and collects the associated data information; the associated data information can be shared files, process information, memory information and the like;
(B) after the execution module 10 finishes collecting all the associated data information, the write-in module 20 caches the associated data information in the memory of the local node; after the cache is completed, the execution module 10 outputs a recovery instruction to the recovery module 30;
(C) the recovery module 30 is configured to recover the process to the state before the pause and release the unlock, so that the process resumes execution, and the checkpoint operation process ends at this time.
In the write-in module 20, the write-in module 20 calculates the corresponding delayed write-in time by using a random delayed checkpoint file processing method, and selects the largest delayed write-in time to be recorded as the total write-in time(ii) a When the write module 20 reaches the delayed write time or the system environment parameter exceeds the set threshold KThreshold valueWhen not suitable for continuing waiting, the associated data information is written into the external storage system 40.
In the present invention, assume that the current computing node set of the cluster is denoted as AP ═ P0,P1,P2,…,Pa-1},P0Representing the first computing node, P, in the current cluster1Representing a second computing node, P, in the current cluster2Representing a third computing node, P, in the current clustera-1And the final calculation node in the current cluster is represented, the corner mark a represents the identification number of the calculation node in the current cluster, and a is the total number of the calculation nodes in the current cluster.
In the present invention, the set of computing nodes AP ═ P where the current application is running0,P1,P2,…,Pa-1The subset (abbreviated as computation operation node set) in (is) is marked as BP ═ BPb,bpb+1,…,bpc},bpbRepresenting any one calculation operation node, the corner mark b represents the identification number of the calculation operation node, and bpb+1Represents bpbThe last calculation run node, bpcRepresenting the last compute run node.B is more than or equal to 0 and less than or equal to c and less than or equal to a-1. And the calculation operation node set BP is { BP ═ BPb,bpb+1,…,bpcIs used to perform computational tasks.
In the present invention, the write process for writing the checkpoint data is separated from the main process to form an independent write module 20, where BP ═ BPb,bpb+1,…,bpcEach node in the } is run in checkpoint software.
In the present invention, the number of times of writing a checkpoint file is denoted as d, the file size of the checkpoint file is denoted as FI L E, the time of writing the checkpoint file is denoted as WT (referred to as file write time for short), and the write speed of writing the checkpoint file is denoted as WV (referred to as file write rate for short)The preservation period of the point file is recorded as TPeriod of time. And when d is taken as the current time, recording the time before d as the previous time d-1, and recording the time after d as the next time d + 1.
For any one calculation operation node bpbThe specific steps for carrying out the random delayed checkpoint file processing are as follows:
step one, for a calculation operation node bpbThe first random delay calculation of (2);
calculation operation node bpbWhen the first random delay calculation is performed, the preset checkpoint holding period T can be referred toPeriod of timeSaid T isPeriod of time1/3 or 1/2 in length as the time segment [ t ] at the first calculationOpener,tStop block]And thus the corresponding end point (right end point)To ensure that the write save operation is completed before the next checkpoint operation. Thus, the bpbThe write time at the first calculation of (2) is recorded asThe bp isbThe first calculation of (a) is noted as the checkpoint file sizeThe bp isbThe first calculation of (1) the checkpoint file write rate is recorded asSince there is no write rate in the initial checkpoint write process, the write rate is not the same as the initial checkpoint write processThe assignment is zero.
Step two, for the calculation operation node bpbThe second random delay calculation of (2);
the bp isbThe first write checksum is acquired during the second calculationWrite time of int filesThen on the one hand useAndcalculating the writing speed of the second time of writing into the checkpoint file, and recording asAnd isOn the other hand useAndcalculating the writing time of writing the checkpoint file for the second time, and recording asAnd isThe bp isbThe second calculation of (2) is noted as the checkpoint file size
Step three, for the calculation operation node bpbThe third random delay calculation of (4);
the bp isbThe writing time of writing the checkpoint file for the second time needs to be acquired during the third calculationThen on the one hand useAndcalculating the writing rate of the third time of writing into the checkpoint file and recording asAnd isOn the other hand useAndcalculating the writing time of the third time of writing into the checkpoint file, and recording asAnd isThe bp isbThe checkpoint file size at the third calculation of (a) is noted
Step four, the processing subsequent to the third calculation is the same as that of step three;
and step five, after the user program exits or the checkpoint software receives the command and does not perform checkpoint operation any more, writing into the checkpoint file is finished.
To illustrate in general terms, there is a need to obtain the write time of the previous write to the checkpoint file during the checkpoint writing processThen on the one hand useAndcalculating the write rate of the current written checkpoint file and recording asAnd isOn the other hand useAndcalculating the write time of the current time written into the checkpoint fileAnd is
In the present invention, in order to set aside 2 timesThe time interval ensures that the current checkpoint file is written completely before the next checkpoint file is executed, so that the right end point except the first checkpoint file has
In the present invention, a writing time zone t is determinedOpener,tStop block]Then, calculating a running node set BP ═ BPb,bpb+1,…,bpcThere will be an independent random value for each node in the }. I.e., the bpbRandom value of node, noteAnd isThe bp isb+1Random value of node, noteAnd isThe bp iscRandom value of node, noteAnd is
Under the determined random value, operating node set BP ═ BP for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position in (c). I.e., the bpbRelative time position of the node, noteAnd isThe bp isb+1Relative time position of the node, noteAnd isThe bp iscRelative time position of the node, noteAnd is
For the calculation of the operating node set BP, { BP ═ at the determined relative time positionsb,bpb+1,…,bpcEach node in the tree determines a predetermined delayed write time. I.e., the bpbPredetermined delay write time of the node, notedAnd isThe bp isb+1Predetermined delay write time of the node, notedAnd isThe bp iscPredetermined delay write time of the node, notedAnd is
In the present invention, the writing time is delayed by a predetermined timeAnduniformly distributed to the whole writing time section [ t ] according to time sequenceOpener,tStop block]And (5) obtaining a time axis.
In the present invention, after obtaining the time axis, the write module 20 performs periodic performance detection, so as to obtain a calculation operation node set BP ═ BPb,bpb+1,…,bpcThe use condition of each node in the data is called feedback information IM; the feedback information IM is finally used to adjust the final write time.
The invention uses the preset initial reference value and the corresponding random delay calculation algorithm, firstly in a corresponding time period tOpener,tStop block]Internal calculation operation node set BP ═ BPb,bpb+1,…,bpcEach compute run node in the set of compute run nodes determines a predetermined delayed write time by: bp of bpbHas a predetermined delay writing time ofAnd isbpb+1Has a predetermined delay writing time ofAnd isbpcHas a predetermined delay writing time ofAnd isThen, before the write-in module really executes the I/O operation, the use information of the relevant hardware such as the CPU, the memory and the like of the general and the relevant specific programs on the local node is periodically detected and used as the feedback information as a reference value, and the corresponding value K is calculatedi+xThereby evaluating if the value exceeds a predetermined standard KpThen, the delay operation is not performed any more, and immediately (note the time as) Executing the write-in operation of the data to avoid occupying the shared resources for a long time from influencing the normal operation of the related programs; if the value K isi+kAlways at a preset threshold value KpAnd if the related program is not influenced by the delay operation, continuing to calculate the delay writing timeA deferred write operation is performed.
The invention relates to a large-scale parallel system check point data writing method capable of sensing the real-time state of a local node and randomly delaying and relieving I/O bottleneck, which comprises the following processing steps:
the method comprises the following steps: determining a preset delay writing time;
step 11, the following describes any one of the calculation operation nodes bp with reference to FIG. 3bRespective corresponding delayed write times in software for checking pointsThe determination process of (1);
first, the write time zone should be determinedSegment is denoted as [ t ]Opener,tStop block]I.e. the earliest time t at which a random delayed write operation is performedOpenerAnd the latest time tStop blockThe time zones are formed. The time when the write module receives the request is generally taken as the zero point of the time section, i.e. the starting point (left end point) tOpenerIt means that a partial write request can be immediately executed without performing a delay operation.
At the same time, the time section [ tOpener,tStop block]Should be less than the time interval for two adjacent checkpoint operationstck<tOpener<tStop block<tck+1And T isPeriod of time=tck+1-tckTo ensure that the next checkpoint operation is performed at time tck+1The current checkpoint save operation has been completed before. On the other hand, before the write module actually performs the write operation, the checkpoint data will be temporarily saved in bpbIn memory, and thus will occupy bpbA portion of the system resources, and thus may affect the performance of the application running to some extent.
Step two: monitoring and feeding back the use efficiency of the local node in real time;
(A) obtaining CPU and memory use information of local total and current program or other necessary related programs through various methods provided by the operating system, such as system call, terminal command, etc., and recording as feedback information, wherein the feedback information comprises current CPU use rate UcpuTotal amount of memory CMTotal amount of memory used UMTotal remaining amount of memory UminVirtual memory swap area size UvmuVirtual memory swap area buffer size UvmcAnd the like. This process is periodic until the write operation is actually performed.
(B) The operation of the feedback mechanism is described in connection with fig. 3. Firstly, the obtained utilization rate information is processed according to the following rule to obtain evaluation parametersAnd is
The specific determination rules may be different, for example, whether the remaining memory is less than the preset value, whether the CPU utilization is higher than the preset value, and the formula using various factors as parameters are used, and they are marked as a different coefficientsiNormalization is carried out, whether the checkpoint data temporarily stored in the memory currently affects the running performance of the current program or not is judged by taking the normalization as a standard, and meanwhile, the influence possibly caused by some interference data can be eliminated, for example, when the feedback information exceeds the preset value K for a plurality of timespOr significantly exceeds a predetermined value Kp(e.g. using) Then, the current node state is determined as a resource shortage state, and a write operation needs to be executed immediately to release the currently occupied shared resource. If the current program operation is indeed affected according to the judgment result, no waiting is carried out until the predetermined delay writing timeAnd immediately performs the write operation at the current time, noted
Step three: delaying writing;
Example 1
As shown in fig. 3, assuming that the total bandwidth of the I/O subsystem is 100GB/s, 16000 nodes perform the operation of writing into the checkpoint file in total, and if the checkpoint data volume written by each node is 10MB, 160GB/s is required to be written within 1s under the ideal environment without considering I/O conflicts; if the time is delayed for 5S, 32GB/S needs to be written in the 5S averagely; if the time delay is 10s, only 16G needs to be written in each second, and the occupation of the bandwidth is reduced to be lower than the total bandwidth of the system. The actual write time should be longer than the theoretical time because a large number of simultaneous writes cause collisions that reduce I/O efficiency, the extent of such collisions decreasing with increasing latency.
The case where the delay time is 0 in fig. 4 represents the case when the random delay writing method is not used. When the delay time just begins to increase, the total writing time is in a descending trend due to the fact that the I/O efficiency is improved due to the fact that competition is reduced; when the delayed write time is very close to or even exceeds the original write time, the I/O efficiency continues to improve, but the total write time does not continue to benefit, but is substantially close to the set delay time.
Claims (3)
1. A large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay is characterized by specifically executing the following steps:
step A, after the write-in module (20) finishes the buffer storage of the associated data information, the current time is obtained as the starting time point t of the time sectionOpener;
Step B, calculating operation node set BP ═ BPb,bpb+1,…,bpcObtaining the end time point t of the time section by each node in the sequence by using a random delay check point file processing methodStop block;
Step C, determining the writing time zone [ t ]Opener,tStop block]Then, recording and calculating a running node set BP ═ BPb,bpb+1,…,bpcAn independent random value existing at each node in the };
step D, under the determined random value, operating node set BP ═ BP { for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position of (a);
step E, under the determined relative time position, calculating the operation node set BP ═ BPb,bpb+1,…,bpcDetermining a preset delay writing time by each node in the data processing unit; uniformly distributing the preset delay writing time to the whole writing time zone [ t ] according to time sequenceOpener,tStop block]Removing to obtain a time axis;
step F, judging whether the current program running time reaches the preset delay writing time, if so, executing step J; if not, executing the step G;
step G, recording a calculation operation node set BP ═ BP { in the same periodb,bpb+1,…,bpcFeedback information of each node in the } is obtained;
step H, obtaining evaluation parameters through feedback informationAnd will beWith a predetermined threshold value KThreshold valueMake a comparison ifExecuting step J; if it isExecuting the step I;
step I, when it is satisfiedIf yes, indicating that the local operation environment allows continuous delay, and turning to step F;
2. The massively parallel system checkpoint data writing method based on random delay mitigation of I/O bottlenecks according to claim 1, characterized in that: in the process of writing checkpoint data of the large-scale parallel system, an execution module (10), a writing module (20) and a recovery module (30) are adopted to relieve the impact on an I/O subsystem, and the checkpoint data is written into an external storage system (40) after the write-delay time is timed.
3. The massively parallel system checkpoint data writing method based on random delay relieving I/O bottleneck as claimed in claim 1, characterized in that the specific steps of the random delay checkpoint file processing are:
step one, for a calculation operation node bpbThe first random delay calculation of (2);
calculation operation node bpbWhen the first random delay calculation is carried out, the preset check point saving period T can be referred toPeriod of timeSaid T isPeriod of time1/3 or 1/2 in length as the time segment [ t ] at the first calculationOpener,tStop block]And thus the corresponding end point (right end point)To ensure that the write save operation is completed before the next checkpoint operation; thus, the bpbThe write time at the first calculation of (2) is recorded asThe bp isbThe first calculation of (2) checkpoint file sizeThe bp isbFirst time computing checkpoint ofThe file write rate is recorded asSince there is no write rate for the initial checkpoint write process, the write rate is not the same for all checkpointsThe value is zero;
step two, for the calculation operation node bpbThe second random delay calculation of (2);
the bp isbAcquiring the write time of the first-time write checkpoint file during the second calculationThen on the one hand useAndcalculating the writing rate of the second time writing check point file, and recording asAnd isOn the other hand useAndcalculating the writing time of the second time of writing the check point file, and recording asAnd isThe bp isbThe second calculation of (2) the checkpoint file size is recorded as
Step three, for the calculation operation node bpbThe third random delay calculation of (4);
the bp isbThe writing time of the second writing check point file is required to be acquired in the third calculationThen on the one hand useAndcalculating the writing rate of the third time writing check point file, and recording asAnd isOn the other hand useAndcalculating the writing time of the third time of writing the check point file, and recording asAnd isThe bp isbThird time counterCompute time checkpoint file size as
Step four, for the calculation operation node bpbThe processing following the third random delay calculation is:
in the checkpoint writing process, the writing time of the previous checkpoint file needs to be acquiredThen on the one hand useAndcalculating the write rate of the current written checkpoint file and recording asAnd isOn the other hand useAndcalculating the write time of the current time written into the checkpoint fileAnd is
and ending the writing of the checkpoint file until the checkpoint file is finished after the user program exits or the checkpoint software receives a command to stop the checkpoint operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810188654.XA CN108491159B (en) | 2018-03-07 | 2018-03-07 | Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810188654.XA CN108491159B (en) | 2018-03-07 | 2018-03-07 | Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108491159A CN108491159A (en) | 2018-09-04 |
CN108491159B true CN108491159B (en) | 2020-07-17 |
Family
ID=63338108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810188654.XA Expired - Fee Related CN108491159B (en) | 2018-03-07 | 2018-03-07 | Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108491159B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992420B (en) * | 2019-04-08 | 2021-10-22 | 苏州浪潮智能科技有限公司 | Parallel PCIE-SSD performance optimization method and system |
CN110569201B (en) * | 2019-08-23 | 2021-09-10 | 苏州浪潮智能科技有限公司 | Method and device for reducing write latency under solid state disk GC |
CN112817541A (en) * | 2021-02-24 | 2021-05-18 | 深圳宏芯宇电子股份有限公司 | Write bandwidth control method, memory storage device and memory controller |
CN118244972A (en) * | 2022-12-24 | 2024-06-25 | 华为技术有限公司 | Data storage method, device and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915257A (en) * | 2012-09-28 | 2013-02-06 | 曙光信息产业(北京)有限公司 | TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method |
CN103631815A (en) * | 2012-08-27 | 2014-03-12 | 深圳市腾讯计算机系统有限公司 | Method, device and system for obtaining check points in block synchronization parallel computing |
US8880941B1 (en) * | 2011-04-20 | 2014-11-04 | Google Inc. | Optimum checkpoint frequency |
CN104798059A (en) * | 2012-12-20 | 2015-07-22 | 英特尔公司 | Multiple computer system processing write data outside of checkpointing |
US9165012B2 (en) * | 2009-10-02 | 2015-10-20 | Symantec Corporation | Periodic file system checkpoint manager |
CN105573866A (en) * | 2009-07-14 | 2016-05-11 | 起元技术有限责任公司 | Method and system for fault tolerant batch processing |
-
2018
- 2018-03-07 CN CN201810188654.XA patent/CN108491159B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105573866A (en) * | 2009-07-14 | 2016-05-11 | 起元技术有限责任公司 | Method and system for fault tolerant batch processing |
US9165012B2 (en) * | 2009-10-02 | 2015-10-20 | Symantec Corporation | Periodic file system checkpoint manager |
US8880941B1 (en) * | 2011-04-20 | 2014-11-04 | Google Inc. | Optimum checkpoint frequency |
CN103631815A (en) * | 2012-08-27 | 2014-03-12 | 深圳市腾讯计算机系统有限公司 | Method, device and system for obtaining check points in block synchronization parallel computing |
CN102915257A (en) * | 2012-09-28 | 2013-02-06 | 曙光信息产业(北京)有限公司 | TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method |
CN104798059A (en) * | 2012-12-20 | 2015-07-22 | 英特尔公司 | Multiple computer system processing write data outside of checkpointing |
Also Published As
Publication number | Publication date |
---|---|
CN108491159A (en) | 2018-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491159B (en) | Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay | |
Cheng et al. | Erms: An elastic replication management system for hdfs | |
US8627143B2 (en) | Dynamically modeling and selecting a checkpoint scheme based upon an application workload | |
CN102591964B (en) | Implementation method and device for data reading-writing splitting system | |
CN111294234B (en) | Parallel block chain fragmentation method based on intelligent contract optimization model | |
IL199516A (en) | Optimizing non-preemptible read-copy update for low-power usage by avoiding unnecessary wakeups | |
CN110096350B (en) | Cold and hot area division energy-saving storage method based on cluster node load state prediction | |
WO2021139166A1 (en) | Error page identification method based on three-dimensional flash storage structure | |
CN111143142B (en) | Universal check point and rollback recovery method | |
CN110032450B (en) | Large-scale deep learning method and system based on solid-state disk extended memory | |
KR20130091368A (en) | Apparatus and method for scheduling kernel execution order | |
CN105760294A (en) | Method and device for analysis of thread latency | |
Yang et al. | Improving Spark performance with MPTE in heterogeneous environments | |
CN111176831B (en) | Dynamic thread mapping optimization method and device based on multithreading shared memory communication | |
KR100243461B1 (en) | Device and method for switching process | |
CN103685544A (en) | Performance pre-evaluation based client cache distributing method and system | |
CN114297002A (en) | Mass data backup method and system based on object storage | |
US20140007135A1 (en) | Multi-core system, scheduling method, and computer product | |
US20150121001A1 (en) | Storage control device and storage control method | |
CN109117247B (en) | Virtual resource management system and method based on heterogeneous multi-core topology perception | |
CN117075800A (en) | I/O perception self-adaptive writing method for massive check point data | |
Zhang et al. | Performance diagnosis and optimization for hyperledger fabric | |
Sharma et al. | Time warp simulation on clumps | |
Wang et al. | Mitigating I/O impact of checkpointing on large scale parallel systems | |
CN109032835B (en) | Software regeneration method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210419 Address after: 100160, No. 4, building 12, No. 128, South Fourth Ring Road, Fengtai District, Beijing, China (1515-1516) Patentee after: Kaixi (Beijing) Information Technology Co.,Ltd. Address before: 100191 Haidian District, Xueyuan Road, No. 37, Patentee before: BEIHANG University |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200717 |