CN108491159B - Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay - Google Patents

Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay Download PDF

Info

Publication number
CN108491159B
CN108491159B CN201810188654.XA CN201810188654A CN108491159B CN 108491159 B CN108491159 B CN 108491159B CN 201810188654 A CN201810188654 A CN 201810188654A CN 108491159 B CN108491159 B CN 108491159B
Authority
CN
China
Prior art keywords
time
writing
checkpoint
write
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810188654.XA
Other languages
Chinese (zh)
Other versions
CN108491159A (en
Inventor
刘轶
孙庆峥
朱延超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kaixi Beijing Information Technology Co ltd
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201810188654.XA priority Critical patent/CN108491159B/en
Publication of CN108491159A publication Critical patent/CN108491159A/en
Application granted granted Critical
Publication of CN108491159B publication Critical patent/CN108491159B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The invention discloses a large-scale parallel system check point data writing method for relieving I/O bottleneck based on random time delay. The invention uses the random delay checkpoint file processing method to determine the preset delay write-in time and disperses the write-in operation in time, thereby reducing the I/O write-in peak value at the same moment and achieving the purpose of relieving the I/O bottleneck. Before the large-scale parallel system executes I/O operation, the associated data information of the large-scale parallel system is periodically detected, if the operation of the application program is influenced, the delay operation is abandoned and the write operation is immediately executed, so that the influence of occupying shared resources for a long time on the normal operation of the application program is avoided; otherwise, continuing to write according to the determined delay writing time. The invention can reduce the pressure on the I/O subsystem caused by the traditional centralized writing mode applied to different system platforms, and obtain higher throughput rate and shorter global blocking time.

Description

Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay
Technical Field
The invention relates to a processing method for dynamically adjusting the optimal writing time of Checkpoint (Checkpoint) data in the field of high-performance computing, in particular to a Checkpoint data writing control method for relieving I/O bottleneck caused by centralized writing of Checkpoint data in a large-scale parallel system.
Background
The high-performance computing mostly adopts a large-scale parallel computing mode, and a high-performance computing system comprises three major parts, namely a computing node, a network interconnection system and a storage system on a hardware infrastructure. The storage system comprises a plurality of I/O nodes and external storage equipment, wherein the I/O nodes run a parallel file system, respond to read-write requests of the computing nodes and realize management and scheduling of the external storage equipment.
With the continuous expansion of the scale of high performance computing systems, when a parallel program runs for a long time by using a large number of computing nodes, software or hardware errors in part of the nodes usually occur, which brings new challenges to the reliability of the system. The Mean Time Between Failures (MTBF) of the current super computer complete system is reduced to several hours. Most of existing long-time running programs belong to MPI (message passing interface) programs, and when an error occurs in a certain node, the MPI program running in the current node may be stopped or stopped, so that the programs in all the nodes must be re-executed, all previous calculation results are lost, and the serious waste of resources is undoubtedly caused. Meanwhile, in a large-scale high-performance computing system, since the mean time between failures is only several hours, in the worst case, the program repeatedly restarts to be executed and cannot be finally executed. Therefore, in order to enable the parallel program to execute correctly, a rollback recovery (rollback recovery) technique is widely used in high-performance computing as a fault-tolerant technique. One class of representatives is checkpoint software.
The checkpoint software periodically saves the relevant information of the application program in all the corresponding nodes at the current moment to form a checkpoint file set consisting of single-node checkpoint files of all the nodes, and then writes the set into the stable storage through the I/O nodes. When a node has an error, checkpoint software reads out the previous checkpoint data, creates a process according to the record, recovers the data, and further recovers the execution of an application program, so as to achieve the purpose of storing the previous operation result.
After the advent of checkpoint-based fault tolerance techniques, how to reduce the overhead incurred during checkpointing has become a major issue in the research of checkpointing techniques. Most checkpointing software, to date, has focused on reducing the amount of checkpoint data that a single node single checkpoint operation saves as a relevant research effort to reduce checkpoint overhead. However, in a node network application scenario of a large-scale or super-large-scale cluster, because the number of nodes in the computing node set is too large, the best effect still cannot be achieved if only the size of the data volume needing to be stored by the checkpoint software is considered. In the currently mainstream checkpoint software using a coordinated synchronization protocol (coordinated), before performing a checkpoint operation, a global synchronization operation is performed on all nodes to achieve a globally consistent state, so as to avoid a possible domino effect (consecutive rollback due to global state inconsistency). After the operation of collecting the checkpoint data is completed, the checkpoint software defaults to directly write the checkpoint data into the external storage system so as to cope with the node down fault which may occur later (if a certain node is down and the checkpoint data of the current node is not written into the stable storage, the corresponding checkpoint data of the node is lost, and the operation of the process related to the node cannot be recovered). Because the number of I/O nodes in the system is far less than that of the computing nodes, the centralized writing of check point data by the computing nodes with a large number of numbers impacts the I/O system, and further forms a system bottleneck, and the problem is more prominent along with the increase of the scale of a high-performance computing system.
For checkpoint software in a massively parallel system, reducing the impact of the checkpoint data writing process on the I/O subsystem is an important indicator for achieving checkpoint software availability. From the bottom level, it is the control of the use of the system shared I/O bandwidth resources. To better control the peak of I/O bandwidth usage, the centralized I/O request may be time resolved to some extent: the check point data is firstly cached in the memory of the current node, the independent writing module is responsible for processing, and then the writing operation is dispersed into a time interval, so that the total operation amount of I/O writing at the same time is reduced. And meanwhile, a feedback regulation mechanism is introduced, in the delay waiting process, the hardware use information of the current computing node is periodically detected, and the information such as the CPU use rate, the memory occupancy rate and the like is used as feedback information and provided to a controller of a write-in module for use, so that a reasonable I/O write-in time strategy is finally obtained.
For the checkpoint software using the coordination protocol, determining the parallelism degree without considering the use condition of system I/O will cause a certain impact on the I/O subsystem, so how to dynamically determine the optimal write-in timing corresponding to the checkpoint software data according to the comprehensive hardware conditions and real-time load conditions of different systems in a self-adaptive manner becomes the key to solve the problem. Aiming at the problems, the invention provides a large-scale parallel system check point data writing method based on a random delay I/O bottleneck.
Disclosure of Invention
The invention discloses a large-scale parallel system checkpoint (checkpoint) data writing method for relieving system I/O bottleneck through random delay. The method separates the checkpoint recovery process and the write-in process, temporarily caches checkpoint data in a memory after the checkpoint data is generated by each node in the system, calculates the delayed write-in time of the checkpoint data by using a corresponding random delayed checkpoint file processing method, and writes the checkpoint data into an external storage subsystem after the delayed write-in time is timed out. In the delay waiting process, the use information of the related hardware is periodically detected, and if the operation of the existing application program is influenced, the delay is abandoned to immediately execute the write-in operation, so that the influence of occupying hardware resources for a long time on the normal operation of the related program is avoided. Compared with the traditional checkpoint data centralized writing mode, the method has the advantages that the checkpoint data writing operation of each node is dispersed in time, the peak value formed by writing checkpoint data into an external storage system simultaneously by all nodes is avoided, the I/O bottleneck of the system can be relieved, and the expandability of a checkpoint system is improved.
The invention relates to a large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay, which specifically comprises the following steps:
step A, after the write-in module (20) finishes the buffer storage of the associated data information, the current time is obtained as the starting time point t of the time sectionOpener
Step B, calculating operation node set BP ═ BPb,bpb+1,…,bpcEach node in the time section obtains an end time point t of the time section by using a random delay checkpoint file processing methodStop block
Step C, determining the writing time zone [ t ]Opener,tStop block]Then, recording and calculating a running node set BP ═ BPb,bpb+1,…,bpcAn independent random value existing at each node in the };
step D, under the determined random value, operating node set BP ═ BP { for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position of (a);
step E, under the determined relative time position, calculating the operation node set BP ═ BPb,bpb+1,…,bpcDetermining a preset delay writing time by each node in the data processing unit; uniformly distributing the preset delay writing time to the whole writing time zone [ t ] according to time sequenceOpener,tStop block]Removing to obtain a time axis;
step F, judging whether the current program running time reaches the preset delay writing time, if so, executing step J; if not, executing the step G;
step G, recording a calculation operation node set BP ═ BP { in the same periodb,bpb+1,…,bpcFeedback information of each node in the } is obtained;
step H, obtaining evaluation parameters through feedback information
Figure BDA0001591059600000041
And will be
Figure BDA0001591059600000042
With a predetermined threshold value KThreshold valueMake a comparison if
Figure BDA0001591059600000043
Executing step J; if it is
Figure BDA0001591059600000044
Executing the step I;
step I, when it is satisfied
Figure BDA0001591059600000045
When the time is longer than the preset time, the local operation environment is allowed to continue delaying, and the step F is carried out;
and step J, writing the cached associated data information into an external storage system (40), and finishing the data writing operation of the current delay check point.
The method for writing the checkpoint data of the large-scale parallel system based on the random delay relieving I/O bottleneck has the advantages that:
① the invention aims at the checkpoint program using the coordination protocol, and can slow down the write peak of checkpoint data by delaying the write, thereby obtaining higher throughput rate.
② the invention uses random delay checkpoint file processing method to determine the delay write time of checkpoint data, and at the same time, the invention can correspondingly adjust the optimal write time of checkpoint data according to the change of system load.
③ each node of the invention independently uses the random delay checkpoint file processing method to calculate the delay write time, does not depend on the centralized global scheduling control, is beneficial to reducing the extra overhead brought by the global synchronous operation in the large-scale parallel computing environment, reduces the processing time and improves the expandability of the checkpoint software.
④ the active latency technique of the present invention is not found in conventional I/O optimization techniques, and the active latency effectively reduces write conflicts and reduces I/O performance loss caused by periodic large simultaneous write operations.
Drawings
FIG. 1 is a flow diagram showing modules in a parallel process of adjusting checkpoint process writes according to the present invention.
FIG. 2 is a schematic diagram of a dynamic adjustment process for a predicted delay write time of any compute run node using the checkpoint writing method of the present invention.
Fig. 3 is a comparison graph of the delay time of the operation of writing a checkpoint file under the same bandwidth and node count.
FIG. 4 is a graph illustrating the efficiency of the delayed write time and the total write time of checkpoint data in I/O contention in a massively parallel system.
10. Execution module 20. Write-in module 30. Recovery module
40. External storage system
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Referring to fig. 1, in the process of writing checkpoint data in the massively parallel system, an execution module 10, a write module 20 and a recovery module 30 are used to relieve the impact on the I/O subsystem, and after the write delay time is timed out, checkpoint data is written into an external storage system 40.
(A) When the large-scale parallel system enters the checkpoint operation, the execution module 10 firstly suspends the process, and then the execution module 10 completes each item of synchronization and collects the associated data information; the associated data information can be shared files, process information, memory information and the like;
(B) after the execution module 10 finishes collecting all the associated data information, the write-in module 20 caches the associated data information in the memory of the local node; after the cache is completed, the execution module 10 outputs a recovery instruction to the recovery module 30;
(C) the recovery module 30 is configured to recover the process to the state before the pause and release the unlock, so that the process resumes execution, and the checkpoint operation process ends at this time.
In the write-in module 20, the write-in module 20 calculates the corresponding delayed write-in time by using a random delayed checkpoint file processing method, and selects the largest delayed write-in time to be recorded as the total write-in time(ii) a When the write module 20 reaches the delayed write time or the system environment parameter exceeds the set threshold KThreshold valueWhen not suitable for continuing waiting, the associated data information is written into the external storage system 40.
In the present invention, assume that the current computing node set of the cluster is denoted as AP ═ P0,P1,P2,…,Pa-1},P0Representing the first computing node, P, in the current cluster1Representing a second computing node, P, in the current cluster2Representing a third computing node, P, in the current clustera-1And the final calculation node in the current cluster is represented, the corner mark a represents the identification number of the calculation node in the current cluster, and a is the total number of the calculation nodes in the current cluster.
In the present invention, the set of computing nodes AP ═ P where the current application is running0,P1,P2,…,Pa-1The subset (abbreviated as computation operation node set) in (is) is marked as BP ═ BPb,bpb+1,…,bpc},bpbRepresenting any one calculation operation node, the corner mark b represents the identification number of the calculation operation node, and bpb+1Represents bpbThe last calculation run node, bpcRepresenting the last compute run node.
Figure BDA0001591059600000051
B is more than or equal to 0 and less than or equal to c and less than or equal to a-1. And the calculation operation node set BP is { BP ═ BPb,bpb+1,…,bpcIs used to perform computational tasks.
In the present invention, the write process for writing the checkpoint data is separated from the main process to form an independent write module 20, where BP ═ BPb,bpb+1,…,bpcEach node in the } is run in checkpoint software.
In the present invention, the number of times of writing a checkpoint file is denoted as d, the file size of the checkpoint file is denoted as FI L E, the time of writing the checkpoint file is denoted as WT (referred to as file write time for short), and the write speed of writing the checkpoint file is denoted as WV (referred to as file write rate for short)The preservation period of the point file is recorded as TPeriod of time. And when d is taken as the current time, recording the time before d as the previous time d-1, and recording the time after d as the next time d + 1.
For any one calculation operation node bpbThe specific steps for carrying out the random delayed checkpoint file processing are as follows:
step one, for a calculation operation node bpbThe first random delay calculation of (2);
calculation operation node bpbWhen the first random delay calculation is performed, the preset checkpoint holding period T can be referred toPeriod of timeSaid T isPeriod of time1/3 or 1/2 in length as the time segment [ t ] at the first calculationOpener,tStop block]And thus the corresponding end point (right end point)
Figure BDA0001591059600000061
To ensure that the write save operation is completed before the next checkpoint operation. Thus, the bpbThe write time at the first calculation of (2) is recorded as
Figure BDA0001591059600000062
The bp isbThe first calculation of (a) is noted as the checkpoint file size
Figure BDA0001591059600000063
The bp isbThe first calculation of (1) the checkpoint file write rate is recorded as
Figure BDA0001591059600000064
Since there is no write rate in the initial checkpoint write process, the write rate is not the same as the initial checkpoint write process
Figure BDA0001591059600000065
The assignment is zero.
Step two, for the calculation operation node bpbThe second random delay calculation of (2);
the bp isbThe first write checksum is acquired during the second calculationWrite time of int files
Figure BDA0001591059600000066
Then on the one hand use
Figure BDA0001591059600000067
And
Figure BDA0001591059600000068
calculating the writing speed of the second time of writing into the checkpoint file, and recording as
Figure BDA0001591059600000069
And is
Figure BDA00015910596000000610
On the other hand use
Figure BDA00015910596000000611
And
Figure BDA00015910596000000612
calculating the writing time of writing the checkpoint file for the second time, and recording as
Figure BDA00015910596000000613
And is
Figure BDA00015910596000000614
The bp isbThe second calculation of (2) is noted as the checkpoint file size
Figure BDA00015910596000000615
Step three, for the calculation operation node bpbThe third random delay calculation of (4);
the bp isbThe writing time of writing the checkpoint file for the second time needs to be acquired during the third calculation
Figure BDA00015910596000000616
Then on the one hand use
Figure BDA00015910596000000617
And
Figure BDA00015910596000000618
calculating the writing rate of the third time of writing into the checkpoint file and recording as
Figure BDA00015910596000000619
And is
Figure BDA00015910596000000620
On the other hand use
Figure BDA00015910596000000621
And
Figure BDA00015910596000000622
calculating the writing time of the third time of writing into the checkpoint file, and recording as
Figure BDA0001591059600000071
And is
Figure BDA0001591059600000072
The bp isbThe checkpoint file size at the third calculation of (a) is noted
Figure BDA0001591059600000073
Step four, the processing subsequent to the third calculation is the same as that of step three;
and step five, after the user program exits or the checkpoint software receives the command and does not perform checkpoint operation any more, writing into the checkpoint file is finished.
To illustrate in general terms, there is a need to obtain the write time of the previous write to the checkpoint file during the checkpoint writing process
Figure BDA0001591059600000074
Then on the one hand use
Figure BDA0001591059600000075
And
Figure BDA0001591059600000076
calculating the write rate of the current written checkpoint file and recording as
Figure BDA0001591059600000077
And is
Figure BDA0001591059600000078
On the other hand use
Figure BDA0001591059600000079
And
Figure BDA00015910596000000710
calculating the write time of the current time written into the checkpoint file
Figure BDA00015910596000000711
And is
Figure BDA00015910596000000712
Figure BDA00015910596000000713
Is the write time to write the checkpoint file at the current time d.
Figure BDA00015910596000000714
Is the write time to write the checkpoint file at the previous time d-1.
Figure BDA00015910596000000715
Is the write rate at which the checkpoint file is written at the current time d.
Figure BDA00015910596000000716
Is the write rate for writing a checkpoint file at the previous time d-1.
Figure BDA00015910596000000717
Is the size of the checkpoint file at the current time d.
Figure BDA00015910596000000718
Is the size of the checkpoint file at the previous time d-1.
In the present invention, in order to set aside 2 times
Figure BDA00015910596000000719
The time interval ensures that the current checkpoint file is written completely before the next checkpoint file is executed, so that the right end point except the first checkpoint file has
Figure BDA00015910596000000720
In the present invention, a writing time zone t is determinedOpener,tStop block]Then, calculating a running node set BP ═ BPb,bpb+1,…,bpcThere will be an independent random value for each node in the }. I.e., the bpbRandom value of node, note
Figure BDA00015910596000000721
And is
Figure BDA00015910596000000722
The bp isb+1Random value of node, note
Figure BDA00015910596000000723
And is
Figure BDA00015910596000000724
The bp iscRandom value of node, note
Figure BDA00015910596000000725
And is
Figure BDA00015910596000000726
Under the determined random value, operating node set BP ═ BP for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position in (c). I.e., the bpbRelative time position of the node, note
Figure BDA00015910596000000727
And is
Figure BDA00015910596000000728
The bp isb+1Relative time position of the node, note
Figure BDA00015910596000000729
And is
Figure BDA0001591059600000081
The bp iscRelative time position of the node, note
Figure BDA0001591059600000082
And is
Figure BDA0001591059600000083
For the calculation of the operating node set BP, { BP ═ at the determined relative time positionsb,bpb+1,…,bpcEach node in the tree determines a predetermined delayed write time. I.e., the bpbPredetermined delay write time of the node, noted
Figure BDA0001591059600000084
And is
Figure BDA0001591059600000085
The bp isb+1Predetermined delay write time of the node, noted
Figure BDA0001591059600000086
And is
Figure BDA0001591059600000087
The bp iscPredetermined delay write time of the node, noted
Figure BDA0001591059600000088
And is
Figure BDA0001591059600000089
In the present invention, the writing time is delayed by a predetermined time
Figure BDA00015910596000000810
And
Figure BDA00015910596000000811
uniformly distributed to the whole writing time section [ t ] according to time sequenceOpener,tStop block]And (5) obtaining a time axis.
In the present invention, after obtaining the time axis, the write module 20 performs periodic performance detection, so as to obtain a calculation operation node set BP ═ BPb,bpb+1,…,bpcThe use condition of each node in the data is called feedback information IM; the feedback information IM is finally used to adjust the final write time.
The invention uses the preset initial reference value and the corresponding random delay calculation algorithm, firstly in a corresponding time period tOpener,tStop block]Internal calculation operation node set BP ═ BPb,bpb+1,…,bpcEach compute run node in the set of compute run nodes determines a predetermined delayed write time by: bp of bpbHas a predetermined delay writing time of
Figure BDA00015910596000000812
And is
Figure BDA00015910596000000813
bpb+1Has a predetermined delay writing time of
Figure BDA00015910596000000814
And is
Figure BDA00015910596000000815
bpcHas a predetermined delay writing time of
Figure BDA00015910596000000816
And is
Figure BDA00015910596000000817
Then, before the write-in module really executes the I/O operation, the use information of the relevant hardware such as the CPU, the memory and the like of the general and the relevant specific programs on the local node is periodically detected and used as the feedback information as a reference value, and the corresponding value K is calculatedi+xThereby evaluating if the value exceeds a predetermined standard KpThen, the delay operation is not performed any more, and immediately (note the time as
Figure BDA00015910596000000818
) Executing the write-in operation of the data to avoid occupying the shared resources for a long time from influencing the normal operation of the related programs; if the value K isi+kAlways at a preset threshold value KpAnd if the related program is not influenced by the delay operation, continuing to calculate the delay writing time
Figure BDA00015910596000000819
A deferred write operation is performed.
The invention relates to a large-scale parallel system check point data writing method capable of sensing the real-time state of a local node and randomly delaying and relieving I/O bottleneck, which comprises the following processing steps:
the method comprises the following steps: determining a preset delay writing time;
step 11, the following describes any one of the calculation operation nodes bp with reference to FIG. 3bRespective corresponding delayed write times in software for checking points
Figure BDA00015910596000000820
The determination process of (1);
first, the write time zone should be determinedSegment is denoted as [ t ]Opener,tStop block]I.e. the earliest time t at which a random delayed write operation is performedOpenerAnd the latest time tStop blockThe time zones are formed. The time when the write module receives the request is generally taken as the zero point of the time section, i.e. the starting point (left end point) tOpenerIt means that a partial write request can be immediately executed without performing a delay operation.
At the same time, the time section [ tOpener,tStop block]Should be less than the time interval for two adjacent checkpoint operations
Figure BDA0001591059600000091
tck<tOpener<tStop block<tck+1And T isPeriod of time=tck+1-tckTo ensure that the next checkpoint operation is performed at time tck+1The current checkpoint save operation has been completed before. On the other hand, before the write module actually performs the write operation, the checkpoint data will be temporarily saved in bpbIn memory, and thus will occupy bpbA portion of the system resources, and thus may affect the performance of the application running to some extent.
Step two: monitoring and feeding back the use efficiency of the local node in real time;
(A) obtaining CPU and memory use information of local total and current program or other necessary related programs through various methods provided by the operating system, such as system call, terminal command, etc., and recording as feedback information, wherein the feedback information comprises current CPU use rate UcpuTotal amount of memory CMTotal amount of memory used UMTotal remaining amount of memory UminVirtual memory swap area size UvmuVirtual memory swap area buffer size UvmcAnd the like. This process is periodic until the write operation is actually performed.
(B) The operation of the feedback mechanism is described in connection with fig. 3. Firstly, the obtained utilization rate information is processed according to the following rule to obtain evaluation parameters
Figure BDA0001591059600000092
And is
Figure BDA0001591059600000093
The specific determination rules may be different, for example, whether the remaining memory is less than the preset value, whether the CPU utilization is higher than the preset value, and the formula using various factors as parameters are used, and they are marked as a different coefficientsiNormalization is carried out, whether the checkpoint data temporarily stored in the memory currently affects the running performance of the current program or not is judged by taking the normalization as a standard, and meanwhile, the influence possibly caused by some interference data can be eliminated, for example, when the feedback information exceeds the preset value K for a plurality of timespOr significantly exceeds a predetermined value Kp(e.g. using
Figure BDA0001591059600000094
) Then, the current node state is determined as a resource shortage state, and a write operation needs to be executed immediately to release the currently occupied shared resource. If the current program operation is indeed affected according to the judgment result, no waiting is carried out until the predetermined delay writing time
Figure BDA0001591059600000095
And immediately performs the write operation at the current time, noted
Figure BDA0001591059600000096
Step three: delaying writing;
(A) each node bpbIndependent periodical execution information IM collection operation and judgment operation acquisition parameter
Figure BDA0001591059600000097
If the current program operation is not influenced, continuing waiting until the preset delay writing time is reached
Figure BDA0001591059600000101
After which the write operation is performed.
Example 1
As shown in fig. 3, assuming that the total bandwidth of the I/O subsystem is 100GB/s, 16000 nodes perform the operation of writing into the checkpoint file in total, and if the checkpoint data volume written by each node is 10MB, 160GB/s is required to be written within 1s under the ideal environment without considering I/O conflicts; if the time is delayed for 5S, 32GB/S needs to be written in the 5S averagely; if the time delay is 10s, only 16G needs to be written in each second, and the occupation of the bandwidth is reduced to be lower than the total bandwidth of the system. The actual write time should be longer than the theoretical time because a large number of simultaneous writes cause collisions that reduce I/O efficiency, the extent of such collisions decreasing with increasing latency.
The case where the delay time is 0 in fig. 4 represents the case when the random delay writing method is not used. When the delay time just begins to increase, the total writing time is in a descending trend due to the fact that the I/O efficiency is improved due to the fact that competition is reduced; when the delayed write time is very close to or even exceeds the original write time, the I/O efficiency continues to improve, but the total write time does not continue to benefit, but is substantially close to the set delay time.

Claims (3)

1. A large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay is characterized by specifically executing the following steps:
step A, after the write-in module (20) finishes the buffer storage of the associated data information, the current time is obtained as the starting time point t of the time sectionOpener
Step B, calculating operation node set BP ═ BPb,bpb+1,…,bpcObtaining the end time point t of the time section by each node in the sequence by using a random delay check point file processing methodStop block
Step C, determining the writing time zone [ t ]Opener,tStop block]Then, recording and calculating a running node set BP ═ BPb,bpb+1,…,bpcAn independent random value existing at each node in the };
step D, under the determined random value, operating node set BP ═ BP { for calculationb,bpb+1,…,bpcEvery node in the tree defines a time zone tOpener,tStop block]Relative time position of (a);
step E, under the determined relative time position, calculating the operation node set BP ═ BPb,bpb+1,…,bpcDetermining a preset delay writing time by each node in the data processing unit; uniformly distributing the preset delay writing time to the whole writing time zone [ t ] according to time sequenceOpener,tStop block]Removing to obtain a time axis;
step F, judging whether the current program running time reaches the preset delay writing time, if so, executing step J; if not, executing the step G;
step G, recording a calculation operation node set BP ═ BP { in the same periodb,bpb+1,…,bpcFeedback information of each node in the } is obtained;
step H, obtaining evaluation parameters through feedback information
Figure FDA0002406648850000011
And will be
Figure FDA0002406648850000012
With a predetermined threshold value KThreshold valueMake a comparison if
Figure FDA0002406648850000013
Executing step J; if it is
Figure FDA0002406648850000014
Executing the step I;
step I, when it is satisfied
Figure FDA0002406648850000015
If yes, indicating that the local operation environment allows continuous delay, and turning to step F;
step J, writing the cached associated data information into an external storage system (40), and recording the current delayed checkpoint data writing time as feedback writing time
Figure FDA0002406648850000016
And the operation ends.
2. The massively parallel system checkpoint data writing method based on random delay mitigation of I/O bottlenecks according to claim 1, characterized in that: in the process of writing checkpoint data of the large-scale parallel system, an execution module (10), a writing module (20) and a recovery module (30) are adopted to relieve the impact on an I/O subsystem, and the checkpoint data is written into an external storage system (40) after the write-delay time is timed.
3. The massively parallel system checkpoint data writing method based on random delay relieving I/O bottleneck as claimed in claim 1, characterized in that the specific steps of the random delay checkpoint file processing are:
step one, for a calculation operation node bpbThe first random delay calculation of (2);
calculation operation node bpbWhen the first random delay calculation is carried out, the preset check point saving period T can be referred toPeriod of timeSaid T isPeriod of time1/3 or 1/2 in length as the time segment [ t ] at the first calculationOpener,tStop block]And thus the corresponding end point (right end point)
Figure FDA0002406648850000021
To ensure that the write save operation is completed before the next checkpoint operation; thus, the bpbThe write time at the first calculation of (2) is recorded as
Figure FDA0002406648850000022
The bp isbThe first calculation of (2) checkpoint file size
Figure FDA0002406648850000023
The bp isbFirst time computing checkpoint ofThe file write rate is recorded as
Figure FDA0002406648850000024
Since there is no write rate for the initial checkpoint write process, the write rate is not the same for all checkpoints
Figure FDA0002406648850000025
The value is zero;
step two, for the calculation operation node bpbThe second random delay calculation of (2);
the bp isbAcquiring the write time of the first-time write checkpoint file during the second calculation
Figure FDA0002406648850000026
Then on the one hand use
Figure FDA0002406648850000027
And
Figure FDA0002406648850000028
calculating the writing rate of the second time writing check point file, and recording as
Figure FDA0002406648850000029
And is
Figure FDA00024066488500000210
On the other hand use
Figure FDA00024066488500000211
And
Figure FDA00024066488500000212
calculating the writing time of the second time of writing the check point file, and recording as
Figure FDA00024066488500000213
And is
Figure FDA00024066488500000214
The bp isbThe second calculation of (2) the checkpoint file size is recorded as
Figure FDA00024066488500000215
Step three, for the calculation operation node bpbThe third random delay calculation of (4);
the bp isbThe writing time of the second writing check point file is required to be acquired in the third calculation
Figure FDA00024066488500000216
Then on the one hand use
Figure FDA00024066488500000217
And
Figure FDA00024066488500000218
calculating the writing rate of the third time writing check point file, and recording as
Figure FDA00024066488500000219
And is
Figure FDA00024066488500000220
On the other hand use
Figure FDA00024066488500000221
And
Figure FDA00024066488500000222
calculating the writing time of the third time of writing the check point file, and recording as
Figure FDA00024066488500000223
And is
Figure FDA00024066488500000224
The bp isbThird time counterCompute time checkpoint file size as
Figure FDA00024066488500000225
Step four, for the calculation operation node bpbThe processing following the third random delay calculation is:
in the checkpoint writing process, the writing time of the previous checkpoint file needs to be acquired
Figure FDA0002406648850000031
Then on the one hand use
Figure FDA0002406648850000032
And
Figure FDA0002406648850000033
calculating the write rate of the current written checkpoint file and recording as
Figure FDA0002406648850000034
And is
Figure FDA0002406648850000035
On the other hand use
Figure FDA0002406648850000036
And
Figure FDA0002406648850000037
calculating the write time of the current time written into the checkpoint file
Figure FDA0002406648850000038
And is
Figure FDA0002406648850000039
Figure FDA00024066488500000310
Is the write time when the checkpoint file is written at the current time d;
Figure FDA00024066488500000311
is the write time to write the checkpoint file at the previous time d-1;
Figure FDA00024066488500000312
is the write rate of the checkpoint file written at the current time d;
Figure FDA00024066488500000313
is the write rate of the checkpoint file written at the previous time d-1;
Figure FDA00024066488500000314
is the size of the checkpoint file at the current time d;
Figure FDA00024066488500000315
is the size of the checkpoint file at the previous time d-1;
and ending the writing of the checkpoint file until the checkpoint file is finished after the user program exits or the checkpoint software receives a command to stop the checkpoint operation.
CN201810188654.XA 2018-03-07 2018-03-07 Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay Expired - Fee Related CN108491159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810188654.XA CN108491159B (en) 2018-03-07 2018-03-07 Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810188654.XA CN108491159B (en) 2018-03-07 2018-03-07 Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay

Publications (2)

Publication Number Publication Date
CN108491159A CN108491159A (en) 2018-09-04
CN108491159B true CN108491159B (en) 2020-07-17

Family

ID=63338108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810188654.XA Expired - Fee Related CN108491159B (en) 2018-03-07 2018-03-07 Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay

Country Status (1)

Country Link
CN (1) CN108491159B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992420B (en) * 2019-04-08 2021-10-22 苏州浪潮智能科技有限公司 Parallel PCIE-SSD performance optimization method and system
CN110569201B (en) * 2019-08-23 2021-09-10 苏州浪潮智能科技有限公司 Method and device for reducing write latency under solid state disk GC
CN112817541A (en) * 2021-02-24 2021-05-18 深圳宏芯宇电子股份有限公司 Write bandwidth control method, memory storage device and memory controller
CN118244972A (en) * 2022-12-24 2024-06-25 华为技术有限公司 Data storage method, device and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915257A (en) * 2012-09-28 2013-02-06 曙光信息产业(北京)有限公司 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method
CN103631815A (en) * 2012-08-27 2014-03-12 深圳市腾讯计算机系统有限公司 Method, device and system for obtaining check points in block synchronization parallel computing
US8880941B1 (en) * 2011-04-20 2014-11-04 Google Inc. Optimum checkpoint frequency
CN104798059A (en) * 2012-12-20 2015-07-22 英特尔公司 Multiple computer system processing write data outside of checkpointing
US9165012B2 (en) * 2009-10-02 2015-10-20 Symantec Corporation Periodic file system checkpoint manager
CN105573866A (en) * 2009-07-14 2016-05-11 起元技术有限责任公司 Method and system for fault tolerant batch processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573866A (en) * 2009-07-14 2016-05-11 起元技术有限责任公司 Method and system for fault tolerant batch processing
US9165012B2 (en) * 2009-10-02 2015-10-20 Symantec Corporation Periodic file system checkpoint manager
US8880941B1 (en) * 2011-04-20 2014-11-04 Google Inc. Optimum checkpoint frequency
CN103631815A (en) * 2012-08-27 2014-03-12 深圳市腾讯计算机系统有限公司 Method, device and system for obtaining check points in block synchronization parallel computing
CN102915257A (en) * 2012-09-28 2013-02-06 曙光信息产业(北京)有限公司 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method
CN104798059A (en) * 2012-12-20 2015-07-22 英特尔公司 Multiple computer system processing write data outside of checkpointing

Also Published As

Publication number Publication date
CN108491159A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108491159B (en) Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay
Cheng et al. Erms: An elastic replication management system for hdfs
US8627143B2 (en) Dynamically modeling and selecting a checkpoint scheme based upon an application workload
CN102591964B (en) Implementation method and device for data reading-writing splitting system
CN111294234B (en) Parallel block chain fragmentation method based on intelligent contract optimization model
IL199516A (en) Optimizing non-preemptible read-copy update for low-power usage by avoiding unnecessary wakeups
CN110096350B (en) Cold and hot area division energy-saving storage method based on cluster node load state prediction
WO2021139166A1 (en) Error page identification method based on three-dimensional flash storage structure
CN111143142B (en) Universal check point and rollback recovery method
CN110032450B (en) Large-scale deep learning method and system based on solid-state disk extended memory
KR20130091368A (en) Apparatus and method for scheduling kernel execution order
CN105760294A (en) Method and device for analysis of thread latency
Yang et al. Improving Spark performance with MPTE in heterogeneous environments
CN111176831B (en) Dynamic thread mapping optimization method and device based on multithreading shared memory communication
KR100243461B1 (en) Device and method for switching process
CN103685544A (en) Performance pre-evaluation based client cache distributing method and system
CN114297002A (en) Mass data backup method and system based on object storage
US20140007135A1 (en) Multi-core system, scheduling method, and computer product
US20150121001A1 (en) Storage control device and storage control method
CN109117247B (en) Virtual resource management system and method based on heterogeneous multi-core topology perception
CN117075800A (en) I/O perception self-adaptive writing method for massive check point data
Zhang et al. Performance diagnosis and optimization for hyperledger fabric
Sharma et al. Time warp simulation on clumps
Wang et al. Mitigating I/O impact of checkpointing on large scale parallel systems
CN109032835B (en) Software regeneration method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210419

Address after: 100160, No. 4, building 12, No. 128, South Fourth Ring Road, Fengtai District, Beijing, China (1515-1516)

Patentee after: Kaixi (Beijing) Information Technology Co.,Ltd.

Address before: 100191 Haidian District, Xueyuan Road, No. 37,

Patentee before: BEIHANG University

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200717