CN116414586A - Fault-tolerant processing method and device for application program and storage medium - Google Patents

Fault-tolerant processing method and device for application program and storage medium Download PDF

Info

Publication number
CN116414586A
CN116414586A CN202111641381.8A CN202111641381A CN116414586A CN 116414586 A CN116414586 A CN 116414586A CN 202111641381 A CN202111641381 A CN 202111641381A CN 116414586 A CN116414586 A CN 116414586A
Authority
CN
China
Prior art keywords
parameter
abstract
abstract function
rerun
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111641381.8A
Other languages
Chinese (zh)
Inventor
冷静文
韩林
刁阳彬
过敏意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111641381.8A priority Critical patent/CN116414586A/en
Publication of CN116414586A publication Critical patent/CN116414586A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The application relates to a fault-tolerant processing method and device for an application program and a storage medium, wherein the method comprises the steps of running the application program, wherein the application program is used for completing a preset calculation task; marking and tracking abstract functions and interface parameters of the abstract functions in the running process of the application program; under the condition that the occurrence of calculation errors of the application program is detected, determining a target abstract function with the occurrence of the calculation errors from the abstract functions; judging whether the target abstract function has idempotency according to the interface parameters of the target abstract function; in the case that the target abstract function has idempotency, the target abstract function is rerun to fix the computing error. The embodiment of the application can realize dynamic idempotent analysis during the running of the application program, support adjustable fault-tolerant granularity and improve the universality and fault-tolerant performance of fault-tolerant processing.

Description

Fault-tolerant processing method and device for application program and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a fault tolerance processing method and apparatus for an application program, and a storage medium.
Background
As semiconductor process nodes continue to advance, the transistor density of processors (e.g., central processing units (central processing unit, CPU), graphics processors (graphics processing unit, GPU), neural-Network Processors (NPU), etc.) continues to increase, but their reliability continues to decrease. For example, large computing clusters or data centers typically perform a large number of computing tasks by processors, which sometimes experience transient errors (e.g., errors in the results of the computation, crashes of applications used to complete the computing tasks, etc.) during the execution of the computing tasks.
In the fields where accuracy of computation is critical (such as fields of scientific computation, financial computation, database service, etc.), in the case of having an effective computation error detection mechanism, the computation error is usually repaired by a re-run mechanism. The re-run mechanism assumes that the computational error is caused by an occasional transient error in the processor component, and re-running the errant computational task can yield the correct result with a high probability.
Current rerun mechanisms are typically based on checkpoints (checkpoints) or idempotent regions (idempotent region) determined based on compilation phases. Wherein, the re-running mechanism based on the check point is applicable to most computing scenes, but the establishment of the check point and the recovery of the computing task from the check point usually involve a large number of memory copy operations, and the performance is poor; the rerun mechanism of the idempotent region determined based on the compiling stage depends on the idempotent analysis of the application program by the compiler, and cannot be adjusted in the running process, so that the limitation is large.
Disclosure of Invention
In view of this, a fault-tolerant processing method, device and storage medium for application programs are provided.
In a first aspect, embodiments of the present application provide an application fault tolerance processing method, where the method includes: running an application program, wherein the application program is used for completing a preset calculation task; in the running process of the application program, marking and tracking an abstract function and interface parameters of the abstract function, wherein the abstract function is used for indicating a function module with a preset level in the application program, the interface parameters of the abstract function comprise input parameters and output parameters, the input parameters are memory areas for storing input data of the function module, and the output parameters are memory areas for storing output data of the function module; under the condition that the computing error of the application program is detected, determining a target abstract function with the computing error from the abstract functions; judging whether the target abstract function has idempotency or not according to the interface parameters of the target abstract function; and re-running the target abstract function to repair the computing error under the condition that the target abstract function has idempotency.
According to the embodiment of the application program, the abstract function and the interface parameters thereof can be marked and tracked according to the preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and the target abstract function is rerun under the condition that the target abstract function has idempotent so as to repair the calculation error, so that dynamic idempotent analysis during the running process of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method of the embodiment of the application program.
In a first possible implementation manner of the application fault tolerance processing method according to the first aspect, the interface parameter has a version number, the version number is used to indicate a data storage state of the interface parameter, and the marking and tracking the abstract function and the interface parameter of the abstract function includes: during the running process of the application program, identifying the functional modules in the application program according to the preset level, and marking the identified functional modules as abstract functions; determining interface parameters of the abstract function and version numbers of the interface parameters; and updating a parameter global version table and a parameter local version table according to the interface parameters of the abstract functions and the version numbers of the interface parameters, wherein the parameter global version table is used for recording the latest version numbers of the interface parameters of all abstract functions of the application program, and the parameter local version table is used for recording the version numbers of the interface parameters when all abstract functions of the application program run.
According to the embodiment of the application, the function modules in the application program can be identified according to the preset level in the running process of the application program, the identified function modules are marked as abstract functions, then the interface parameters and the version numbers of the abstract functions are determined, and the global version table and the local version table of the parameters are updated according to the interface parameters and the version numbers of the abstract functions, so that the abstract functions and the interface parameters thereof can be marked in the running process of the application program, and the version changes (namely read-write history) of the memory areas indicated by the interface parameters (input parameters and output parameters) of the abstract functions are tracked through simple data structures (the global version table PGVMap and the local version table PLVMap of the parameters), the abstract functions and the interface parameters thereof are marked and tracked, the complexity is low, the cost is low, and the processing efficiency when the abstract functions and the interface parameters thereof are marked and tracked can be improved.
In a second possible implementation manner of the fault-tolerant processing method of an application according to the first possible implementation manner of the first aspect, updating the global version table and the local version table of the parameter according to the interface parameter of the abstract function and the version number of the interface parameter includes: and adding the identifier of the abstract function, the identifier and the version number of the interface parameter of the abstract function into a parameter local version table.
According to the embodiment of the application, when the parameter local version table is updated, the identifier of the abstract function and the identifier and version number of the interface parameter of the abstract function can be added into the parameter local version table, so that dynamic updating and maintenance of the parameter local version table can be realized, and the accuracy of the parameter local version table is improved.
In a third possible implementation manner of the application fault tolerance processing method according to the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, updating the parameter global version table and the parameter local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table; or under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, adding the identifier and the version number of the interface parameters of the abstract function into the parameter global version table.
In the embodiment of the application, when the parameter global version table is updated, under the condition that the interface parameters of the abstract function exist in the parameter global version table, the version numbers of the interface parameters of the abstract function in the parameter global version table are updated; under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, the identifiers and version numbers of the interface parameters of the abstract function are added into the parameter global version table, so that the dynamic updating and maintenance of the parameter global version table can be realized, and the accuracy of the parameter global version table is improved.
In a fourth possible implementation manner of the fault tolerant processing method for an application according to any one of the first possible implementation manner to the third possible implementation manner of the first aspect, updating a global version table and a local version table of parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters includes: updating the version number of the interface parameter in the parameter global version table under the condition that the interface parameter is modified by other operations outside the abstract function of the application program; and marking the interface parameters with version numbers lower than the version numbers in the parameter global version table in the parameter local version table as invalid.
According to the embodiment of the application, under the condition that the interface parameters of the abstract function are modified by other operations outside the abstract function of the application program, the version number of the interface parameters in the parameter global version table is updated, and the interface parameters with the version numbers lower than the version numbers in the parameter global version table are marked as invalid in the parameter local version table, so that the modification of the interface parameters of the abstract function by other operations outside the abstract function can be tracked, and the accuracy of the parameter global version table and the parameter local version table is improved.
In a fifth possible implementation manner of the fault tolerant processing method for an application according to any one of the first possible implementation manner to the fourth possible implementation manner of the first aspect, the determining, according to the interface parameter of the target abstract function, whether the target abstract function has idempotent includes: judging whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table; in the absence of the same identifier, the target abstract function is determined to have idempotency.
According to the embodiment of the application, whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table or not is judged, and the target abstract function is determined to have idempotent under the condition that the same identifier does not exist, so that the analysis of the idempotent during operation based on the parameter local version table is realized, and the method is simple and quick, and therefore the processing efficiency can be improved.
In a sixth possible implementation manner of the application fault tolerant processing method according to any one of the first possible implementation manner to the fifth possible implementation manner of the first aspect, the method further includes: searching, in the abstract function, a rerun set for indicating a set of minimum abstract functions that must be rerun to fix the computing error, in the case that the target abstract function does not have idempotent; and under the condition that the rerun set search is successful, the rerun set is operated to repair the calculation error.
According to the embodiment of the application, under the condition that the target abstract function does not have idempotent property, the rerun set is searched in the abstract function, and under the condition that the rerun set is successfully searched, the rerun set is run to repair the computing error, so that the computing error can be repaired through the rerun set comprising a plurality of abstract functions.
In a seventh possible implementation manner of the fault tolerant processing method of the application according to the sixth possible implementation manner of the first aspect, the searching the rerun set in the abstract function includes: searching a rerun set in the abstract function according to the parameter global version table and the parameter local version table.
According to the embodiment of the application, the rerun set can be searched from the abstract function according to the parameter local version table and the parameter global version table. Because the parameter local version table and the parameter global version table are simple data structures, the rerun set is searched from the abstract function according to the parameter local version table and the parameter global version table, the method is simple and quick, and the searching efficiency of the rerun set can be improved.
In an eighth possible implementation manner of the fault tolerant processing method of the application according to the seventh possible implementation manner of the first aspect, the searching the running re-set in the abstract function according to the parameter global version table and the parameter local version table includes: adding the target abstract function into a rerun set; judging whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table; judging whether second parameters with version numbers lower than those in the parameter global version table exist in the input parameters of all abstract functions in the rerun set under the condition that the first parameters do not exist; judging whether a first abstract function with output parameters including the second parameters exists in the abstract function according to the parameter local version table under the condition that the second parameters exist; judging whether the rerun set comprises the first abstract function or not under the condition that the first abstract function exists in the abstract functions; adding the first abstract function to the rerun set if the rerun set does not include the first abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the first abstract function.
The process of searching the rerun collection in embodiments of the present application may be referred to as a forward search, by which it may be determined which other abstract functions must be first run to recreate the original input (i.e., correct input data) of the target abstract function in order to rerun the target abstract function.
In a ninth possible implementation manner of the fault tolerant processing method of an application according to the eighth possible implementation manner of the first aspect, the searching the rerun set in the abstract function according to the global version table of the parameter and the local version table of the parameter includes: judging whether third parameters with version numbers lower than those in the parameter global version table exist in the output parameters of all abstract functions in the rerun set under the condition that the second parameters do not exist or the rerun set comprises the first abstract function; judging whether a second abstract function with input parameters including the third parameter exists in the abstract function according to the parameter local version table under the condition that the third parameter exists; judging whether the rerun set comprises the second abstract function or not under the condition that the second abstract function exists in the abstract functions; adding the second abstract function to the rerun set if the rerun set does not include the second abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the second abstract function.
In the embodiment of the present application, the process of searching the rerun set may be referred to as a backward search, and it may be determined by the backward search which abstract functions must be rerun in order not to destroy the current memory context, that is, the abstract functions that restore the memory context to the state recorded by the current parameter local version table PLVMap may be found by the backward search.
In a tenth possible implementation manner of the fault tolerant processing method of the application according to the ninth possible implementation manner of the first aspect, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes: the rerun set search is successful in the absence of the third parameter or the rerun set includes the second abstract function.
According to the embodiment of the application, under the condition that the third parameter does not exist or the rerun set comprises the second abstract function, the rerun set is determined to be successfully searched, the method and the device are simple and quick, and the searching efficiency of the rerun set can be improved.
In an eleventh possible implementation manner of the fault tolerant processing method of an application according to the ninth possible implementation manner of the first aspect, the searching the rerun set in the abstract function according to the global version table of the parameter and the local version table of the parameter includes any one of the following: if the first parameter exists, the rerun set search fails; the rerun set search fails in the absence of the first abstract function in the abstract function; the rerun collection search fails in the absence of the second abstract function in the abstract function.
According to the embodiment of the application, under the condition that the first parameter exists or the condition that the first abstract function or the second abstract function does not exist in the abstract function, the search failure of the rerun set is determined, the method and the device are simple and quick, and the search efficiency of the rerun set can be improved.
In a twelfth possible implementation manner of the fault tolerant processing method of the application according to the sixth possible implementation manner of the first aspect, the running the re-running set includes: generating a local control flow graph corresponding to the rerun set according to the parameter local version table; and operating the rerun set according to the local control flow graph.
According to the embodiment of the application, the local control flow graph corresponding to the rerun set can be generated according to the parameter local version table, and the rerun set is operated according to the local control flow graph, so that the local control flow graph of the rerun set can be dynamically generated based on the parameter local version table in the operation process of an application program, the generation of the local control flow graph is not analyzed in dependence on compiling, the method is simple and rapid, the processing speed of the local control flow graph is improved, and the operation efficiency of the rerun set is improved.
In a thirteenth possible implementation manner of the application fault tolerant processing method according to any one of the sixth possible implementation manner to the twelfth possible implementation manner of the first aspect, the method further includes: in the running process of the application program, at least one check point is established according to a preset check point establishing rule; under the condition that the rerun set search fails, determining a target check point nearest to the target abstract function from the at least one check point; starting from the target checkpoint, re-running the application to fix the computing error.
According to the embodiment of the application program, at least one check point can be established according to the preset check point establishment rule in the running process of the application program, when the running set search fails, a target check point closest to the target abstract function is determined from the at least one check point, and the application program is restarted from the target check point, so that the application program can be restarted from the check point closest to the target abstract function to repair the calculation error when the running set search fails.
In a fourteenth possible implementation form of the application fault tolerant processing method according to the first aspect as such or any one of the first possible implementation form of the first aspect to the thirteenth possible implementation form of the first aspect, the application is run on a heterogeneous computing platform comprising a processor and an accelerator, the marking and tracking abstract functions and interface parameters of the abstract functions comprises: during the running of the application, the kernel function that the processor transmits to the accelerator is marked as an abstract function.
According to the embodiment of the application, when the application program runs on the heterogeneous computing platform comprising the processor and the accelerator, the processor is transmitted to the kernel function of the accelerator to be marked as the abstract function in the running process of the application program, so that the mark of the abstract function is tightly combined with the running platform of the application program, and the memory management interface on the heterogeneous computing platform (comprising the processor and the accelerator) is conveniently packaged, so that the mark and tracking of the abstract function and the interface parameters thereof in the running process of the application program are realized.
In a second aspect, embodiments of the present application provide an application fault tolerant processing apparatus, the apparatus including: the first operation module is used for operating an application program, and the application program is used for completing a preset calculation task; the marking and tracking module is used for marking and tracking an abstract function and interface parameters of the abstract function in the running process of the application program, wherein the abstract function is used for indicating a function module with a preset level in the application program, the interface parameters of the abstract function comprise input parameters and output parameters, the input parameters are memory areas for storing input data of the function module, and the output parameters are memory areas for storing output data of the function module; the target abstract function determining module is used for determining a target abstract function with the calculation error from the abstract functions under the condition that the calculation error of the application program is detected; the idempotent judging module is used for judging whether the target abstract function has idempotent according to the interface parameters of the target abstract function; and the second operation module is used for rerun the target abstract function to repair the calculation error under the condition that the target abstract function has idempotency.
According to the embodiment of the application program, the abstract function and the interface parameters thereof can be marked and tracked according to the preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and the target abstract function is rerun under the condition that the target abstract function has idempotent so as to repair the calculation error, so that dynamic idempotent analysis during the running process of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method of the embodiment of the application program.
In a first possible implementation manner of the application fault tolerant processing device according to the second aspect, the interface parameter has a version number, the version number is used to indicate a data storage state of the interface parameter, and the marking and tracking module is configured to: during the running process of the application program, identifying the functional modules in the application program according to the preset level, and marking the identified functional modules as abstract functions; determining interface parameters of the abstract function and version numbers of the interface parameters; and updating a parameter global version table and a parameter local version table according to the interface parameters of the abstract functions and the version numbers of the interface parameters, wherein the parameter global version table is used for recording the latest version numbers of the interface parameters of all abstract functions of the application program, and the parameter local version table is used for recording the version numbers of the interface parameters when all abstract functions of the application program run.
According to the embodiment of the application, the function modules in the application program can be identified according to the preset level in the running process of the application program, the identified function modules are marked as abstract functions, then the interface parameters and the version numbers of the abstract functions are determined, and the global version table and the local version table of the parameters are updated according to the interface parameters and the version numbers of the abstract functions, so that the abstract functions and the interface parameters thereof can be marked in the running process of the application program, and the version changes (namely read-write history) of the memory areas indicated by the interface parameters (input parameters and output parameters) of the abstract functions are tracked through simple data structures (the global version table PGVMap and the local version table PLVMap of the parameters), the abstract functions and the interface parameters thereof are marked and tracked, the complexity is low, the cost is low, and the processing efficiency when the abstract functions and the interface parameters thereof are marked and tracked can be improved.
In a second possible implementation manner of the application fault tolerant processing device according to the first possible implementation manner of the second aspect, updating the global version table and the local version table of the parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters includes: and adding the identifier of the abstract function, the identifier and the version number of the interface parameter of the abstract function into a parameter local version table.
According to the embodiment of the application, when the parameter local version table is updated, the identifier of the abstract function and the identifier and version number of the interface parameter of the abstract function can be added into the parameter local version table, so that dynamic updating and maintenance of the parameter local version table can be realized, and the accuracy of the parameter local version table is improved.
In a third possible implementation manner of the application fault tolerant processing device according to the first possible implementation manner of the second aspect or the second possible implementation manner of the second aspect, updating the parameter global version table and the parameter local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table; or under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, adding the identifier and the version number of the interface parameters of the abstract function into the parameter global version table.
In the embodiment of the application, when the parameter global version table is updated, under the condition that the interface parameters of the abstract function exist in the parameter global version table, the version numbers of the interface parameters of the abstract function in the parameter global version table are updated; under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, the identifiers and version numbers of the interface parameters of the abstract function are added into the parameter global version table, so that the dynamic updating and maintenance of the parameter global version table can be realized, and the accuracy of the parameter global version table is improved.
In a fourth possible implementation manner of the application fault tolerant processing device according to any one of the first possible implementation manner to the third possible implementation manner of the second aspect, the updating the parameter global version table and the parameter local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: updating the version number of the interface parameter in the parameter global version table under the condition that the interface parameter is modified by other operations outside the abstract function of the application program; and marking the interface parameters with version numbers lower than the version numbers in the parameter global version table in the parameter local version table as invalid.
According to the embodiment of the application, under the condition that the interface parameters of the abstract function are modified by other operations outside the abstract function of the application program, the version number of the interface parameters in the parameter global version table is updated, and the interface parameters with the version numbers lower than the version numbers in the parameter global version table are marked as invalid in the parameter local version table, so that the modification of the interface parameters of the abstract function by other operations outside the abstract function can be tracked, and the accuracy of the parameter global version table and the parameter local version table is improved.
In a fifth possible implementation manner of the application fault tolerant processing device according to any one of the first possible implementation manner to the fourth possible implementation manner of the second aspect, the idempotent judging module is configured to: judging whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table; in the absence of the same identifier, the target abstract function is determined to have idempotency.
According to the embodiment of the application, whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table or not is judged, and the target abstract function is determined to have idempotent under the condition that the same identifier does not exist, so that the analysis of the idempotent during operation based on the parameter local version table is realized, and the method is simple and quick, and therefore the processing efficiency can be improved.
In a sixth possible implementation manner of the application fault tolerant processing device according to any one of the first possible implementation manner to the fifth possible implementation manner of the second aspect, the device further includes: a rerun set search module configured to search, in the abstract function, for a rerun set indicating a set of a minimum abstract function that must be rerun to fix the computing error, in the event that the target abstract function does not have idempotent; and the third operation module is used for operating the rerun set to repair the calculation error under the condition that the rerun set search is successful.
According to the embodiment of the application, under the condition that the target abstract function does not have idempotent property, the rerun set is searched in the abstract function, and under the condition that the rerun set is successfully searched, the rerun set is run to repair the computing error, so that the computing error can be repaired through the rerun set comprising a plurality of abstract functions.
In a seventh possible implementation manner of the application fault tolerant processing apparatus according to the sixth possible implementation manner of the second aspect, the rerun collection search module is configured to: searching a rerun set in the abstract function according to the parameter global version table and the parameter local version table.
According to the embodiment of the application, the rerun set can be searched from the abstract function according to the parameter local version table and the parameter global version table. Because the parameter local version table and the parameter global version table are simple data structures, the rerun set is searched from the abstract function according to the parameter local version table and the parameter global version table, the method is simple and quick, and the searching efficiency of the rerun set can be improved.
In an eighth possible implementation manner of the fault tolerant processing apparatus according to the second aspect, the searching the abstract function for a rerun set includes: adding the target abstract function into a rerun set; judging whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table; judging whether second parameters with version numbers lower than those in the parameter global version table exist in the input parameters of all abstract functions in the rerun set under the condition that the first parameters do not exist; judging whether a first abstract function with output parameters including the second parameters exists in the abstract function according to the parameter local version table under the condition that the second parameters exist; judging whether the rerun set comprises the first abstract function or not under the condition that the first abstract function exists in the abstract functions; adding the first abstract function to the rerun set if the rerun set does not include the first abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the first abstract function.
The process of searching the rerun collection in embodiments of the present application may be referred to as a forward search, by which it may be determined which other abstract functions must be first run to recreate the original input (i.e., correct input data) of the target abstract function in order to rerun the target abstract function.
In a ninth possible implementation manner of the fault tolerant processing apparatus according to the second aspect, the searching the abstract function for a rerun set includes: judging whether third parameters with version numbers lower than those in the parameter global version table exist in the output parameters of all abstract functions in the rerun set under the condition that the second parameters do not exist or the rerun set comprises the first abstract function; judging whether a second abstract function with input parameters including the third parameter exists in the abstract function according to the parameter local version table under the condition that the third parameter exists; judging whether the rerun set comprises the second abstract function or not under the condition that the second abstract function exists in the abstract functions; adding the second abstract function to the rerun set if the rerun set does not include the second abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the second abstract function.
In the embodiment of the present application, the process of searching the rerun set may be referred to as a backward search, and it may be determined by the backward search which abstract functions must be rerun in order not to destroy the current memory context, that is, the abstract functions that restore the memory context to the state recorded by the current parameter local version table PLVMap may be found by the backward search.
In a tenth possible implementation manner of the fault tolerant processing apparatus according to the second aspect, the searching the abstract function for a rerun set includes: the rerun set search is successful in the absence of the third parameter or the rerun set includes the second abstract function.
According to the embodiment of the application, under the condition that the third parameter does not exist or the rerun set comprises the second abstract function, the rerun set is determined to be successfully searched, the method and the device are simple and quick, and the searching efficiency of the rerun set can be improved.
According to a ninth possible implementation manner of the second aspect, in an eleventh possible implementation manner of the application fault tolerant processing device, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes any one of the following: if the first parameter exists, the rerun set search fails; the rerun set search fails in the absence of the first abstract function in the abstract function; the rerun collection search fails in the absence of the second abstract function in the abstract function.
According to the embodiment of the application, under the condition that the first parameter exists or the condition that the first abstract function or the second abstract function does not exist in the abstract function, the search failure of the rerun set is determined, the method and the device are simple and quick, and the search efficiency of the rerun set can be improved.
In a twelfth possible implementation manner of the application fault tolerant processing device according to the sixth possible implementation manner of the second aspect, the third running module is configured to: generating a local control flow graph corresponding to the rerun set according to the parameter local version table; and operating the rerun set according to the local control flow graph.
According to the embodiment of the application, the local control flow graph corresponding to the rerun set can be generated according to the parameter local version table, and the rerun set is operated according to the local control flow graph, so that the local control flow graph of the rerun set can be dynamically generated based on the parameter local version table in the operation process of an application program, the generation of the local control flow graph is not analyzed in dependence on compiling, the method is simple and rapid, the processing speed of the local control flow graph is improved, and the operation efficiency of the rerun set is improved.
In a thirteenth possible implementation manner of the application fault tolerant processing device according to any one of the sixth possible implementation manner of the second aspect to the twelfth possible implementation manner of the second aspect, the device further includes: the checkpointing module is used for setting up at least one checkpoint according to preset checkpointing rules in the running process of the application program; the target check point determining module is used for determining a target check point closest to the target abstract function from the at least one check point under the condition that the rerun set search fails; and the fourth running module is used for starting from the target check point and re-running the application program so as to repair the computing error.
According to the embodiment of the application program, at least one check point can be established according to the preset check point establishment rule in the running process of the application program, when the running set search fails, a target check point closest to the target abstract function is determined from the at least one check point, and the application program is restarted from the target check point, so that the application program can be restarted from the check point closest to the target abstract function to repair the calculation error when the running set search fails.
In a fourteenth possible implementation form of the application fault tolerant processing device according to the second aspect as such or any of the first possible implementation form of the second aspect to the thirteenth possible implementation form of the second aspect, the application is run on a heterogeneous computing platform comprising a processor and an accelerator, the marking and tracking module is configured to: during the running of the application, the kernel function that the processor transmits to the accelerator is marked as an abstract function.
According to the embodiment of the application, when the application program runs on the heterogeneous computing platform comprising the processor and the accelerator, the processor is transmitted to the kernel function of the accelerator to be marked as the abstract function in the running process of the application program, so that the mark of the abstract function is tightly combined with the running platform of the application program, and the memory management interface on the heterogeneous computing platform (comprising the processor and the accelerator) is conveniently packaged, so that the mark and tracking of the abstract function and the interface parameters thereof in the running process of the application program are realized.
In a third aspect, embodiments of the present application provide an application fault tolerance processing apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described first aspect or one or more of the possible implementations of the first aspect when executing the instructions.
According to the embodiment of the application program, the abstract function and the interface parameters thereof can be marked and tracked according to the preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and the target abstract function is rerun under the condition that the target abstract function has idempotent so as to repair the calculation error, so that dynamic idempotent analysis during the running process of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method of the embodiment of the application program.
In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described first aspect or one or more of the possible implementations of the first aspect.
According to the embodiment of the application program, the abstract function and the interface parameters thereof can be marked and tracked according to the preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and the target abstract function is rerun under the condition that the target abstract function has idempotent so as to repair the calculation error, so that dynamic idempotent analysis during the running process of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method of the embodiment of the application program.
In a fifth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, causes a processor in the electronic device to perform the application fault tolerant processing method of the first aspect or one or more of the possible implementations of the first aspect.
According to the embodiment of the application program, the abstract function and the interface parameters thereof can be marked and tracked according to the preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and the target abstract function is rerun under the condition that the target abstract function has idempotent so as to repair the calculation error, so that dynamic idempotent analysis during the running process of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method of the embodiment of the application program.
These and other aspects of the application will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present application and together with the description, serve to explain the principles of the present application.
Fig. 1a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 1b shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 1c shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 1d shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 1e shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
FIG. 2 illustrates a flow chart of an application fault tolerance processing method according to an embodiment of the present application.
FIG. 3 illustrates a schematic diagram of abstract functions in an application fault tolerance processing method according to an embodiment of the application.
Fig. 4 is a schematic diagram of a parameter global version table in an application fault tolerance processing method according to an embodiment of the present application.
Fig. 5 is a schematic diagram of a parameter local version table in an application fault tolerance processing method according to an embodiment of the present application.
FIG. 6 illustrates a schematic diagram of marking and tracking abstract functions and their interface parameters according to one embodiment of the application.
FIG. 7 illustrates a flow chart of an application fault tolerance processing method according to an embodiment of the present application.
FIG. 8 shows a schematic diagram of a process of searching for a rerun collection according to an embodiment of the present application.
Fig. 9 shows a schematic diagram of a local control flow graph according to an embodiment of the present application.
Fig. 10 is a schematic diagram illustrating a processing procedure of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 11a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
FIG. 11b is a schematic diagram illustrating a processing procedure of an application fault tolerance processing method according to an embodiment of the present application
Fig. 12a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application.
Fig. 12b is a schematic diagram illustrating a processing procedure of the application fault tolerance processing method according to an embodiment of the present application.
FIG. 13 illustrates a block diagram of an application fault tolerance processing device according to an embodiment of the present application.
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits have not been described in detail as not to unnecessarily obscure the present application.
In the related art, in the process of executing a computing task, a processor usually repairs a computing error by a re-running mechanism in the case that the computing error is detected. The re-run mechanism assumes that the computational error is caused by an occasional transient error in the processor component, and re-running the errant computational task can yield the correct result with a high probability.
In some implementations, the computational errors are repaired by a checkpoint (checkpoint) based re-run mechanism. Since the intermediate calculation result or the final calculation result may be written into the memory area or the register temporarily storing the input parameters in the calculation process, the initial input parameters are destroyed, and in order to correctly rerun the erroneous calculation task, the correct initial input parameters must be recovered. The checkpoint-based re-run mechanism backs up all input parameters to a specified memory region before the processor begins executing the computing task, a process that may be referred to as checkpointing, and then begins executing the computing task. In the process of executing the computing task, if the computing error is detected, returning to the process of starting the computing task, restoring the initial input parameters of the computing task by using the input parameters backed up by the check points, and then re-executing the computing task.
For example, a Checkpoint/re-run (Checkpoint/re-run) library may be created that the processor uses to backup the context (including but not limited to global variables, stacks, register values, etc.) that the computing task needs to use into a Checkpoint File (Checkpoint File) when initiating a parallel computing task, and then execute the computing task. In the process of executing the computing task, if a computing error is detected, a scene before the computing task is run is reconstructed by using the backed-up checkpoint file, and then the computing task is re-executed to repair the computing error.
Although the re-running mechanism based on the check point is suitable for repairing the computing errors in most computing scenes and has strong universality, in this way, the establishment of the check point and the recovery of the computing task from the check point involve a large amount of memory copying operation, and the performance is poor, especially in the computing scenes using GPU, NPU or single instruction multiple data (single instruction multiple data, SIMD) instructions, since a large amount of data can be processed by a small amount of instructions, the memory copying operation involved in the establishment and recovery process of the check point consumes a large amount of time, and even can exceed the time required for executing the computing task, namely the performance loss caused by the application of the fault tolerance mechanism can exceed the loss caused by the computing errors, so that the fault tolerance mechanism is meaningless.
In other implementations, an idempotent region (Idempotent Region) of the application is typically determined during the compilation phase and the computational error is repaired by a re-run mechanism based on the idempotent region, i.e., the computational error is repaired using the re-run mechanism based on the idempotent region determined during the compilation phase.
Idempotent region refers to a computer instruction sequence conforming to idempotent, wherein idempotent refers to that the computer instruction sequence does not change original input in the operation process, and the computer instruction sequence repeatedly operates to obtain the same result. For all accessed memory regions and registers, there are Read Only (RO), write Only (WO) and Read After Write (RAW) operations, and no Write After Read (WAR) operations.
When a rerun mechanism of an idempotent region determined based on a compiling stage is used, idempotent of an application program needs to be analyzed by a compiler in the compiling stage. Specifically, in the compiling stage, the compiler analyzes idempotent of each segmented region of the source code of the application program, and divides the application program into a plurality of idempotent regions according to whether each segmented region has idempotent. During the running of an application (i.e., during execution of a computing task by a processor), if a computing error is detected and occurs in an idempotent region, the computing error can be repaired by simply directly re-executing the idempotent region in the current memory context.
For example, the processor may be modified to enable error repair using idempotent regions. Since the modified processor performs the prediction execution (speculative execution) only in the idempotent region, even if a prediction error (mis-specification) occurs, the program count register (program counter register, PC) is directly restored to the start position of the current idempotent region without restoring the register context, and the idempotent region is re-executed. The GPU architecture may also be modified to enable the GPU to utilize idempotent regions for error repair.
However, the above-described manner of enabling a processor to utilize idempotent regions for error repair by modifying hardware (e.g., processor, GPU architecture) involves not only hardware modification, but also relies on the compiler's idempotent analysis of the application program, and requires the processor to provide instructions that specifically mark the beginning and end of the idempotent regions. Therefore, the method can only be applied to specially designed processors, and the use scene is limited greatly.
The instruction sequence of the application program can also be analyzed in the compiling stage, the instruction sequence is divided into a plurality of single-entry multiple-exit (SEME) areas, each SEME area contains tens to hundreds of instructions, and then idempotency of each SEME area is analyzed. In the running process of the application program, for the calculation errors in the idempotent SEME area, the calculation errors can be repaired by directly backing to the starting position of the SEME area for re-execution.
This approach, although not involving hardware changes, still relies on the compiler's analysis of the idempotent of the application. For example, a compiler still needs to perform analysis of a compiling hierarchy on source code of an application program, obtain a control flow graph (control flow graph, CFG) of the application program, thereby dividing an instruction sequence of the application program into a plurality of sequences of SEME regions, and analyze idempotency of each SEME region.
Therefore, the re-running mechanism of the idempotent region determined based on the compiling stage not only depends on the idempotent analysis of the application program by the compiler (namely, the idempotent analysis is performed on the application program in advance) and cannot be adjusted in the running process, but also depends on specific hardware or software characteristics, is only suitable for partial computing scenes, and has poor universality.
In other embodiments, an asymmetric fault tolerance method of CPU-accelerator (GPU, NPU, etc.) fusion is used to repair the computing errors. The method divides a computer of a CPU-accelerator architecture into a strong fault-tolerant domain (strong resilient domin) and a weak fault-tolerant domain (weak resilient domain), wherein the strong fault-tolerant domain consists of a CPU and a memory accessed by the CPU, and the weak fault-tolerant domain consists of an accelerator and a memory accessed by the accelerator. The strong fault-tolerant domain is responsible for fault-tolerant operations such as checkpointing, re-executing computing tasks, and the like. The weak fault-tolerant domain is responsible for detecting errors and does not diffuse the errors to the strong fault-tolerant domain.
The asymmetric fault-tolerant method is applicable to a calculation scene of a CPU-accelerator architecture and other calculation scenes which can be divided into strong and weak fault-tolerant domains, but the method still needs to analyze the idempotency of an application program through a compiler in advance, can not be adjusted in the running process, is only applicable to the complete determination of the input of a calculation task, has almost no scene of changing the execution sequence by external variables, and has high limitation. Furthermore, this approach is also difficult to apply in computing architectures that cannot strictly divide fault-tolerant domains, such as database applications, applications where there is shared memory access and the computing process involves many CPU involvement, etc.
In order to solve the technical problems, the application provides an application fault tolerance processing method. The fault-tolerant processing method for the application program can mark and track an abstract function and interface parameters thereof in the running process of the application program (used for completing a preset calculation task), wherein the abstract function is used for indicating a function module of a preset level in the application program, the interface parameters of the abstract function comprise input parameters and output parameters, the input parameters are memory areas for storing input data of the function module, and the output parameters are memory areas for storing output data of the function module; under the condition that the calculation error of the application program is detected, determining a target abstract function with the calculation error from the abstract functions, and judging whether the target abstract function has idempotent according to interface parameters of the target abstract function; in the case that the target abstract function has idempotency, the target abstract function is rerun to fix the computing error.
In this way, the abstract function and the interface parameters thereof can be marked and tracked according to a preset level in the running process of the application program, the target abstract function with the calculation error is determined under the condition that the calculation error is detected, then the idempotent of the target abstract function is judged according to the interface parameters of the target abstract function, and under the condition that the target abstract function has idempotent, the target abstract function is rerun to repair the calculation error, so that dynamic idempotent analysis during the running of the application program can be realized, the idempotent analysis of the application program is not dependent on a specially designed compiler or hardware any more, and the adjustable fault-tolerant granularity can be supported, thereby improving the universality and the fault-tolerant performance of the application program fault-tolerant processing method in the embodiment of the application program.
The application fault-tolerant processing method of the embodiment of the application program can be applied to a general computing scene and a special computing scene, and is used for repairing the computing error (i.e. recovering the correct computing process) with the lowest cost as possible when the computing error occurs in the running application program due to the hardware error of the processor, so as to obtain a correct computing result. That is, the application fault tolerance processing method of the embodiment of the present application may be used for hardware computing fault tolerance in general computing scenarios and special computing scenarios.
General purpose computing scenarios include scenarios in which computing tasks are performed on a CPU-centric computer architecture. For example, general purpose computing tasks are performed on a server with a CPU of the X86 architecture or ARM architecture as a core. Computing tasks include processing, converting, and storing input data, such as database applications, data stream processing applications, and the like.
Special purpose computing scenarios include scenarios in which specific computing tasks are performed on an asymmetric architecture based on a CPU and a special purpose computing accelerator. For example, scientific computing tasks or deep neural network computing programs are executed on a computer of a CPU-GPU architecture.
Fig. 1a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 1a is a general-purpose computing scenario, where a computer architecture of the general-purpose computing scenario includes a CPU 110, a memory 120, and an input/output (I/O) device 130, where the CPU 110, the memory 120, and the I/O device 130 may be connected by means of a bus (bus), a network on chip (NoC), or the like, which is not limited in this application.
The memory 120 may be a random access memory (random access memory, RAM) or other memory, which is not limited to the specific implementation of the memory 120.
The application scenario shown in fig. 1a can be regarded as: the application fault tolerance processing method of the embodiment of the application is applied to a computer based on a single CPU.
Fig. 1b shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 1b is a general-purpose computing scenario, where a computer architecture of the general-purpose computing scenario includes a plurality of CPUs 110, a shared memory 140 and I/O devices 130, where the CPUs 110, the shared memory 140 and the I/O devices 130 may be connected by a bus (bus), a NoC, or the like, which is not limited in this application.
The shared memory 140 may be a RAM or other storage, and the specific implementation of the shared memory 140 is not limited in this application.
The application scenario shown in fig. 1b can be regarded as: the application fault tolerance processing method is applied to a symmetrical architecture computer based on a plurality of CPUs with the same architecture.
Fig. 1c shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 1c is a special-purpose computing scenario, where a computer architecture of the special-purpose computing scenario includes a plurality of CPUs 110, a plurality of accelerators (accelerators) 150, a shared memory 140, and an I/O device 130, where the CPUs 110, the accelerators 150, the shared memory 140, and the I/O device 130 may be connected by a bus (bus), a NoC, or the like, which is not limited in this application.
The accelerator 150, i.e. a computing accelerator, may be, for example, a graphics processor GPU, a neural network processor NPU, etc., and the specific type of the accelerator 150 is not limited in the present application; the specific number of CPUs 110 and accelerators 150 in the application scenario is also not limited by the present application.
Fig. 1d shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 1d is a special-purpose computing scenario, where a computer architecture of the special-purpose computing scenario includes a plurality of CPUs 110, a plurality of accelerators 150 and I/O devices 130, where the CPUs 110, the accelerators 150 and the I/O devices 130 may be connected by a bus (bus), noC, or the like, which is not limited in this application. The specific number of CPUs 110 and accelerators 150 in the application scenario is also not limited by the present application.
Each CPU 110 is physically connected to its own dedicated memory, such as memory 160 in fig. 1d, and each CPU 110 has independent memory access control over the memory 160 physically connected to itself. Each accelerator 150 is physically coupled to its own dedicated memory, such as memory 170 in fig. 1d, and each accelerator 150 has independent memory access control over the memory 170 to which it is physically coupled. The memory 160, 170 may be RAM or other memory, and the specific implementation of the memory 160, 170 is not limited in this application.
Fig. 1e shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 1e is a special-purpose computing scenario, where a computer architecture of the special-purpose computing scenario includes a plurality of CPUs 110, a plurality of accelerators 150, a shared memory 140 and I/O devices 130, where the CPUs 110, the accelerators 150, the shared memory 140 and the I/O devices 130 may be connected by a bus (bus), noC, or the like, which is not limited in this application. The specific number of CPUs 110 and accelerators 150 in the application scenario is also not limited by the present application.
Each CPU 110 is physically connected to its own dedicated memory, such as memory 160 in fig. 1e, and each CPU 110 has independent memory access control over the memory 160 physically connected to itself. Each accelerator 150 is physically coupled to its own dedicated memory, such as memory 170 in fig. 1e, and each accelerator 150 has independent memory access control over the memory 170 to which it is physically coupled.
The shared memory 140, the memory 160, and the memory 170 may be RAM or other memories, and the specific implementation of the shared memory 140, the memory 160, and the memory 170 is not limited in this application.
In contrast to fig. 1d, in the application scenario shown in fig. 1e, the CPU 110 and the accelerator 150 can access the shared memory 140 in addition to the memory physically connected to themselves.
The application scenarios shown in fig. 1c, 1d, and 1c above can be considered as: the application fault tolerance processing method is applied to an asymmetric architecture computer based on a CPU-accelerator fusion architecture.
It should be noted that, although the application scenarios of the application fault-tolerant processing method of the embodiment of the present application are described by taking fig. 1a, 1b, 1c, 1d and 1e as examples, the application fault-tolerant processing method of the embodiment of the present application may also be applied to other computing scenarios, for example, scenarios in which computing tasks are performed on an asymmetric architecture computer based on a CPU of a different architecture. The application program fault tolerance processing method is not limited to specific application scenes.
The application fault tolerance processing method of the embodiment of the application supports running on various computer architectures. Included in the computer architecture is one or more processors (e.g., CPU, GPU, NPU or other architecture processors), and memory (e.g., RAM). Computer architectures include, but are not limited to, a computer based entirely on a single CPU, a symmetrical architecture computer based on multiple CPUs of the same architecture, an asymmetrical computer architecture based on multiple CPUs of different architectures, an asymmetrical computer architecture based on CPU-accelerators, and the like.
Each processor in the computer architecture may physically connect to its own dedicated memory and each processor has independent memory address control over the dedicated memory that it physically connects to (as shown in fig. 1 d); multiple processors in a computer architecture may also share the same memory (as shown in fig. 1b, 1 c) by physically connecting external memory controllers; each processor in the computer architecture may be connected to its own dedicated memory, while multiple processors may also share another memory and its address space (as shown in fig. 1 e).
That is, in the computer architecture, each processor may be physically connected to its own dedicated memory, have its own independent memory address space, or may share a memory with other processors by physically connecting to a memory controller, and share a single memory address space with other processors.
It should be noted that, the specific computer architecture to which the application program fault tolerance processing method is applied is not limited, and the connection manner of the processor and the memory in the computer architecture is not limited.
FIG. 2 illustrates a flow chart of an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 2, the fault tolerance processing method for the application program includes:
Step S210, an application program is operated, wherein the application program is used for completing a preset calculation task.
The application program can be used for completing preset calculation tasks in the fields of financial calculation, scientific calculation, database service and the like. The application is not limited by the specific type of computing task, the specific field to which the computing task belongs, and the like.
The application may be run by the processor to complete a preset computing task.
Step S220, marking and tracking abstract functions and interface parameters of the abstract functions in the running process of the application program.
An abstract function (abstract function, AF) may be used to indicate a preset level of functional modules in an application. The functional module may be a specific code segment, a functional function, a kernel function, a class method, a module formed by a plurality of functions or class methods, a thread formed by a plurality of functions or class methods, a computing task with preset granularity, and the like in the application program.
Where kernel functions (also called operators) refer to a predefined piece of program executed by a compute accelerator core (e.g., a compute accelerator processor core such as a GPU, NPU, etc.). The kernel function may be used to perform a particular computation that may be accelerated by a computational accelerator, such as matrix multiplication, etc. Kernel functions are typically well defined for specific computing functions and input-output data types. The computation accelerator performs a computation on one sample of data using a deep neural network model, typically requiring multiple operators to be performed.
The level of the functional module may be an instruction level, an operator level, a task level, a thread level, or other level. The level of abstract functions may also be instruction level, operator level, task level, thread level, or other levels, corresponding to the functional modules. For example, the level of the function module is an operator level, and correspondingly, the level of the abstract function is also an operator level; the level of the functional module is a thread level, and correspondingly, the level of the abstract function is also a thread level. The level of the functional modules and the abstract functions can be set by those skilled in the art according to the actual situation, and the application is not particularly limited.
In this way, the application fault-tolerant processing method of the embodiment of the application can support adjustable fault-tolerant granularity, and compared with the prior art (limited to fixed single granularity, for example, only supporting fault-tolerant processing of an instruction level), the application fault-tolerant processing method of the embodiment of the application can not only realize fault-tolerant processing under low-cost constraint, but also improve fault-tolerant performance and possibly improve the universality of fault-tolerant processing.
In one possible implementation, the interface parameters of the abstract function include input parameters and output parameters. The input parameters of the abstract function are memory areas for storing input data of the function module indicated by the abstract function, and the output parameters of the abstract function are memory areas for storing output data of the function module indicated by the abstract function.
The input parameters of the abstract function may be one or more, and the memory area of each function module continuously storing the input data indicated by the abstract function may be regarded as one input parameter of the abstract function. Similarly, the output parameters of the abstract function may be one or more, and the memory area of each function module that continuously stores the output data indicated by the abstract function may be regarded as one output parameter of the abstract function.
The abstract function can process the data in the input parameters to obtain a processing result, and the processing result is written into the output parameters. The input parameters (i.e., the memory area storing the input data) and the output parameters (i.e., the memory area storing the output data) of the same abstract function may be completely independent, partially overlapping, or completely overlapping, which is not limited in this application. The input parameters of one abstract function may be the output parameters of another abstract function; the output parameters of one abstract function may be input parameters of another abstract function.
In one possible implementation, an identifier may be defined for an interface parameter of the abstract function, i.e., an identifier may be defined for an input parameter and an output parameter of the abstract function. The identifier is a unique identifier of the input parameter and the output parameter, and the address of the memory area corresponding to the input parameter and the output parameter can be used as the identifier thereof, or the identifiers of the input parameter and the output parameter can be determined in other manners, which is not limited in the application.
In one possible implementation, the interface parameters of the abstract function may be rewritten during the running of the application. In order to distinguish between the state before the overwriting of the interface parameter and the state after the overwriting, a version number may be defined for the state when the interface parameter stores a certain data, and the version number may be represented by an integer (e.g., 0, 1, 2, 3, … …).
That is, the interface parameters of the abstract function have a version number that may be used to indicate the data-holding state of the interface parameters of the abstract function. Specifically, the version number of the input parameter of the abstract function may be used to indicate the data storage state of the memory area storing the input data of the functional module indicated by the abstract function, and the version number of the output parameter of the abstract function may be used to indicate the data storage state of the memory area storing the output data of the functional module indicated by the abstract function.
For example, it is assumed that the version number of the interface parameter of the abstract function is represented by an integer greater than or equal to 0, the initial value is 0, the version number of the interface parameter is set to be 0 after the initialization assignment is performed on the interface parameter of the abstract function, and the version number is increased by 1 after the interface parameter is rewritten.
In one possible implementation, the abstract function reads data from the input parameters when running, and the version number of the input parameters is unchanged; the abstract function writes the calculation result to the output parameter, i.e. the output parameter is rewritten, the version number of the output parameter changes, for example, the version number of the output parameter increases by 1.
When the version number of the interface parameter of the abstract function changes, the corresponding memory area can be considered to be written with data (i.e. rewritten). Regardless of whether the written data is identical to the original data (i.e., the data before writing), the version number is updated whenever there is a write operation. The updated version number is higher than the version number before the update.
The version number of the interface parameter may be expressed by other means, so long as the state before writing and the state after writing can be distinguished, and the specific expression of the version number of the interface parameter is not limited in the present application.
FIG. 3 illustrates a schematic diagram of abstract functions in an application fault tolerance processing method according to an embodiment of the application. As shown in fig. 3, the input parameters of the abstract function af_a are a and b, the version number of a is 1, the version number of b is also 1, and the input parameters of the abstract function af_a may be marked as a1 (a is an identifier of the input parameter, 1 is a version number of the input parameter a) and b1 (b is an identifier of the input parameter, 1 is a version number of the input parameter b); the output parameter of the abstract function af_a is also a, that is, the output parameter a and the input parameter a of the abstract function af_a are the same memory area, the abstract function af_a writes the operation result into the output parameter a, and since the writing operation occurs, the version number of a is changed from 1 to 2, the output parameter of the abstract function af_a can be marked as a2 (a is an identifier of the output parameter, and 2 is the version number of the output parameter a).
In one possible implementation, since the abstract function specifically indicates an instruction sequence, a code segment, a kernel function, a function (or method), a thread, etc. in the application program, the life cycle of the abstract function may start from the first executed instruction to the last executed instruction in all computer instructions to which it relates.
In one possible implementation, the lifecycle of the interface parameters (i.e., the input parameters and the output parameters) of the abstract function is independent of the lifecycle of the abstract function. The life cycle of the interface parameter is started from the time that the corresponding memory area is applied by the operating system and valid data is stored for the first time, and the memory area is released and recycled to the operating system. The lifecycle of the interface parameters may span the lifecycle of one or more abstract functions.
In one possible implementation, if the life cycle of the data in the memory area read by the abstract function exceeds the life cycle of the abstract function, the memory area needs to be listed as an interface parameter of the abstract function, that is, the memory area needs to be listed as an input parameter and/or an output parameter of the abstract function. That is, data in interface parameters that do not belong to the abstract function will not have a lifecycle that exceeds the lifecycle of the abstract function.
In one possible implementation, the application runtime may be considered to be composed and described by multiple abstract functions, such that the abstract functions and their interface parameters (i.e., input parameters and output parameters) may be marked and tracked during the application runtime.
In one possible implementation manner, when the abstract function and the interface parameters thereof are marked and tracked, the function modules in the application program can be identified according to the level of the preset function modules in the running process of the application program, and the identified function modules are marked as the abstract function; then determining interface parameters and version numbers (namely input parameters and version numbers thereof, output parameters and version numbers thereof) of the abstract function; and then updating the parameter global version table and the parameter local version table according to the interface parameters of the abstract function and the version numbers thereof.
The parameter global version table (parameter global version map, PGVMap) is used to record the latest version numbers of interface parameters (including input parameters and output parameters) of all abstract functions of the application program. The parameter global version table PGVMap may include identifiers and latest version numbers of interface parameters of all abstract functions of the application. The parameter global version table PGVMap may also include a mapping of identifiers of interface parameters to their latest version numbers. The parameter global version table PGVMap exists independently of all abstract functions.
In one possible implementation, when updating the parameter global version table PGVMap according to the interface parameter of the abstract function and the version number thereof, whether the interface parameter of the abstract function exists in the parameter global version table PGVMap may be checked according to the identifier of the interface parameter of the abstract function. And under the condition that the interface parameters of the abstract function do not exist in the parameter global version table PGVMap, adding the identifiers and version numbers of the interface parameters of the abstract function into the parameter global version table PGVMap.
And updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table PGVMap. Specifically, whether the version number of the interface parameter of the abstract function is the same as the version number of the interface parameter in the parameter global version table PGVMap or not can be judged, and if the version numbers of the interface parameter of the abstract function and the version number of the interface parameter in the parameter global version table PGVMap are different, the version number of the interface parameter of the abstract function is used to update the version number of the interface parameter in the parameter global version table PGVMap.
By the method, dynamic updating and maintenance of the parameter global version table can be realized, so that the accuracy of the parameter global version table is improved.
Fig. 4 is a schematic diagram of a parameter global version table in an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 4, the abstract function af_b implements a function of "para_c=para_a+para_b", and the interface parameters thereof include input parameters para_a, para_b, and output parameters para_c. Wherein, para_a, para_b, para_c are identifiers of interface parameters.
When the abstract function af_b runs, its input parameters are para_a (address is 0x000a000, value is 100) with version number 2 and para_b (address is 0x000a004, value is 200) with version number 4, and its output parameters are para_c (address is 0x000a 008). After the abstract function af_b is completed, the output result is 300, and the output result 300 may be stored in a memory area indicated by the output parameter para_c, that is, a memory area with an address of 0x000a 008.
Because of the writing operation, the version number of the output parameter para_c changes, the changed version number is 3, and then the version number of para_c in the parameter global version table PGVMap is updated to 3, i.e. the latest version number of para_c is 3. Since the abstract function af_b only reads the input parameters para_a and para_b and does not perform a write operation, the version numbers of para_a and para_b are unchanged, and the version numbers of para_a and para_b in the parameter global version table PGVMap remain unchanged.
The abstract function af_c realizes the functions of "para_d=para_d+para_e", and the interface parameters include input parameters para_d, para_e and output parameters para_d. Wherein, para_d and para_e are identifiers of interface parameters.
When the abstract function af_c runs, its input parameters are para_d (address 0x000b000, value 100) with version number 0 and para_e (address 0x000b004, value 200) with version number 0, and its output parameters are para_d (address 0x000b 000), i.e. para_d is an input/output parameter. After the abstract function af_c is run, an output result 300 is obtained, and the output result 300 may be stored in a memory area indicated by the output parameter para_d, that is, a memory area with an address of 0x000b 000.
Since the para_d is written as the output parameter, the version number of the para_d is changed, the version number is increased by 1, and the version number is changed from 0 to 1, that is, the version number of the para_d as the output parameter of the abstract function af_c is 1, and the version number of the para_d recorded in the parameter global version table PGVMap is 0, and the version number 1 of the para_d as the output parameter of the abstract function af_c is higher than the version number 0 of the para_d in the parameter global version table PGVMap, and the version number of the para_d in the parameter global version table can be updated from 0 to 1 (that is, 0- > 1). After updating, the latest version number of para_d recorded in the parameter global version table PGVMap is 1. Since the abstract function af_c only reads the input parameter para_e and does not perform a write operation, the version number of para_e is unchanged, and the version number of para_e in the parameter global version table remains unchanged.
Also shown in fig. 4 is the mapping of the identifiers of the interface parameters to their version numbers in the parameter global version table PGVMap. For example, the latest version number of the interface parameter para_a is 2, the latest version number of para_b is 4, the latest version number of para_c is 3, the latest version number of para_d is 1, and the latest version number of para_e is 0.
In one possible implementation, the parameter global version table PGVMap may be described by a pseudo code. The parametric global version table PGVMap shown in fig. 4 can be described by the following pseudo code:
Figure BDA0003443872210000181
it should be noted that, the parameter global version table PGVMap may also be represented by a matrix, an array, etc., and those skilled in the art may set a specific representation mode of the parameter global version table PGVMap according to actual situations, which is not limited in this application.
In one possible implementation, a parameter local version table (Parameter Local Version Map, PLVMap) may be used to record the version numbers of the interface parameters (including the version number of the input parameters and the version number of the output parameters) at the time the various abstract functions of the application are run. That is, the parameter local version table PLVMap may be used to record the relationship of each abstract function of an application program with a specific version of input parameters and output parameters at runtime. The parameter local version table PLVMap exists in dependence of the abstract function.
In one possible implementation, the parameter local version table PLVMap may include parameter version sub-tables corresponding to respective abstract functions of the application, each parameter version sub-table being operable to record an identifier of the corresponding abstract function and an identifier and version number of an interface parameter of the abstract function. The identifier and version number of the interface parameter of each abstract function of the application program in running can be obtained from the parameter local version table PLVMap, and the identifier and version number comprise the identifier and version number of the input parameter and the identifier and version number of the output parameter.
In one possible implementation, when updating the parameter local version table PLVMap according to the interface parameter and its version number of the abstract function, the identifier of the abstract function and the identifier and version number of the interface parameter of the abstract function may be added to the parameter local version table PLVMap. Wherein the identifier of the abstract function is a unique identifier of the abstract function. The memory address of the functional module indicated by the abstract function may be determined as an identifier of the abstract function. The identifier of the abstract function may also be determined by other means, which is not limiting in this application.
For example, when updating the parameter local version table PLVMap according to the interface parameter and its version number of the abstract function, a parameter version sub-table corresponding to the abstract function may be added to the parameter local version table PLVMap, and the identifier of the abstract function, the identifier and version number of the input parameter, and the identifier and version number of the output parameter may be written into the parameter version sub-table, thereby completing the updating of the parameter local version table PLVMap.
By the method, the dynamic updating and maintenance of the parameter local version table can be realized, so that the accuracy of the parameter local version table is improved.
Fig. 5 is a schematic diagram of a parameter local version table in an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 5, the abstract function af_c has a function of "para_d=para_d+para_e", and the interface parameters thereof include an input parameter para_d, para_e, and an output parameter para_d. Wherein, para_d and para_e are identifiers of interface parameters.
When the abstract function AF_C runs, the input parameters are para_d with the version number of 0 and para_e with the version number of 0, and the output parameters are para_d, namely, the para_d is the input and output parameters. After the abstract function af_c is operated, an output result is obtained, and the output result can be stored in a memory area indicated by the output parameter para_d. Since para_d is written as an output parameter, the version number of para_d changes, and the version number increases by 1 and changes from 0 to 1.
When the parameter global version table PGVMap is updated, the abstract function AF_C only reads the input parameter para_e and does not perform writing operation, so that the version number of the para_e is unchanged, and the version number of the para_e in the parameter global version table is kept unchanged; since the version number of para_d in the parameter global version table PGVMap is 0, and the version number of the output parameter para_d of the abstract function af_c is 1, which is higher than the version number 0 of para_d in the parameter global version table PGVMap, the version number of para_d in the parameter global version table PGVMap is updated from 0 to 1 (i.e., 0- > 1). After updating, the latest version number of para_d recorded in the parameter global version table PGVMap is 1.
When updating the parameter local version table PLVMap, the parameter local version table PLVMap may be added with a parameter version sub-table corresponding to the abstract function af_c, and the identifier of the abstract function af_c (i.e. af_c), the identifier and version number of the input parameter, the identifier and version number of the output parameter, and the parameter version sub-table may be written into the parameter version sub-table:
when the abstract function AF_C operates, the version numbers of the input parameters para_d and para_e are 0, and the input parameters para_d and the version number 0 of the abstract function AF_C and the input parameters para_e and the version number 0 of the abstract function AF_C can be recorded at the input (input) position in the parameter version sub-table corresponding to the abstract function AF_C;
when the abstract function af_c is running, the version number of the output parameter para_d is 1, and the output parameter para_d of the abstract function af_c and the version number 1 thereof can be recorded at the output (output) position in the parameter version sub-table corresponding to the abstract function af_c.
Unlike the parameter global version table PGVMap, which records only the latest version number of the parameter para_d, the parameter local version table PLVMap has recorded therein the version number 0 of para_d as the input parameter of the abstract function af_c and the version number 1 of para_d as the output parameter of the abstract function af_c. Thus, the parameter local version table PLVMap may record the relationship of the abstract function at run-time with a particular version of the input parameters and output parameters.
As can be seen from fig. 5, in the case that one parameter (for example, para_d) is both an input parameter of an abstract function (for example, abstract function af_c) and an output parameter of the abstract function, that is, in the case that one parameter is an input/output parameter, when the global version table PGVMap of the parameter is updated, the latest version number (for example, version number 1 of para_d) of the parameter is recorded; when updating the parameter local version table PLVMap, the parameter is recorded as the version number (for example, version number 0 of para_d) when the parameter is input, the input position in the parameter version sub-table corresponding to the abstract function, the version number (for example, version number 1 of para_d) when the parameter is output, and the output position in the parameter version sub-table corresponding to the abstract function.
Also shown in fig. 5 is the relationship of parameter global version table PGVMap to parameter local version table PLVMap: the version numbers of the input parameter para_e are unchanged, and the version numbers of the para_e recorded by the parameter global version table PGVMap and the parameter local version table PLVMap are 1; the version number of the input/output parameter para_d changes by 0- >1, in the parameter local version table PLVMap, the input (input) position in the parameter version sub-table corresponding to the abstract function, the version number 0 when para_d is taken as the input parameter (i.e. version number before update), the output (output) position in the parameter version sub-table corresponding to the abstract function, the version number 1 when para_d is taken as the output parameter (i.e. version number after update) are recorded; in the parameter global version table PGVMap, only the latest version number 1 of para_d is recorded.
In addition, fig. 5 also shows the mapping relationship between the identifier of the interface parameter in the parameter local version table PGVMap and its version number. For example, when the abstract function af_c is running, the version number of the input parameter para_d is 0, the version number of the input parameter para_e is 0, and the version number of the output parameter para_d is 1.
It should be noted that, in the embodiment shown in fig. 5, the parameter local version table PGVMap only records the interface parameters of one abstract function af_c, and in practical application, the parameter local version table PGVMap may record the interface parameters of a plurality of abstract functions, and the number of abstract functions recorded by the parameter local version table PGVMap is not limited in this application.
In one possible implementation, the parameter local version table PLVMap may be described by a pseudo code. The parameter local version table PLVMap shown in fig. 5 can be described by the following pseudo code:
Figure BDA0003443872210000201
it should be noted that, the parameter local version table PLVMap may be expressed in other manners, and those skilled in the art may select a specific expression manner of the parameter local version table PLVMap according to actual situations, which is not limited in this application.
In one possible implementation, in the case that the interface parameter of the abstract function is modified by an operation other than the abstract function of the application program, the version number of the interface parameter in the parameter global version table PGVMap may be updated, for example, the version number of the interface parameter in the parameter global version table PGVMap is increased by 1, and then the interface parameter in the parameter local version table PLVMap, whose version number is lower than the version number in the parameter global version table PGVMap, is marked as invalid.
For example, the version number of the interface parameter para_f is 5, and in the case that the interface parameter para_f is modified by an operation other than the abstract function of the application program, the version number of the interface parameter para_f in the parameter global version table PGVMap may be updated to 6, and then the interface parameter para_f having a version number lower than 6 (the version number of the interface parameter para_f in the parameter global version table PGVMap) in the parameter local version table PLVMap may be marked as invalid.
Under the condition that the interface parameters of the abstract function are modified by other operations outside the abstract function of the application program, the version number of the interface parameters in the parameter global version table is updated, and the interface parameters with the version numbers lower than the version numbers in the parameter global version table are marked as invalid in the parameter local version table, so that the modification of the interface parameters of the abstract function by the operations outside the abstract function can be tracked, and the accuracy of the parameter global version table and the parameter local version table is improved.
By the method, the function modules in the application program can be identified according to the preset level in the running process of the application program, the identified function modules are marked as abstract functions, then the interface parameters and version numbers of the abstract functions are determined, and the global version table and the local version table of the parameters are updated according to the interface parameters and the version numbers of the abstract functions, so that the abstract functions and the interface parameters thereof can be marked in the running process of the application program, and the version changes (namely the read-write history) of the memory areas indicated by the interface parameters (input parameters and output parameters) of the abstract functions can be tracked through simple data structures (the global version table PGVMap and the local version table PLVMap of the parameters), the abstract functions and the interface parameters thereof can be marked and tracked, the complexity is low, the cost is low, and the processing efficiency when the abstract functions and the interface parameters thereof are marked and tracked can be improved.
In one possible implementation manner, the abstract function and the interface parameters thereof are marked and tracked during the running process of the application program, and the abstract function parameter reading and writing tracking module can be used for realizing the abstract function. The abstract function parameter read-write tracking module may provide a functional interface for marking abstract functions and their interface parameters.
For example, the function interface for marking may mark a function module in an application program as an abstract function according to a preset level, where the function module is a specific code segment, a function, a kernel function, a class method, a module formed by multiple functions or class methods, a thread formed by multiple functions or class methods, a computing task with a preset granularity, and the like in the application program. The function interface for marking may also mark a memory area storing input data of the function module indicated by the abstract function as an input parameter of the abstract function, and mark a memory area storing output data of the function module indicated by the abstract function as an output parameter of the abstract function.
In one possible implementation, the functional interface provided by the abstract function parameter read-write tracking module for marking may be described by the following pseudo code:
API_AF_launch(function,params,input_param_index,output_param_index)
In the above functional interface, api_af_association represents an interface name of an application programming interface (application programming interface, API) that marks the abstract function and its interface parameters, function represents an identifier or address of the abstract function, parameters represents a list of identifiers or memory addresses of the interface parameters of the abstract function, input_parameter_index represents an input parameter in the Params, and output_parameter_index represents an output parameter in the Params.
In one possible implementation, the abstract function parameter read-write tracking module further provides a functional interface for tracking all read-write operations to the abstract function's interface parameters over the application lifecycle, whether the read-write operations are from any abstract function of the application or not other parts of the abstract function.
In one possible implementation, the functional interface provided by the abstract function parameter read-write tracking module for tracking may be described by the following pseudo code:
API_malloc(ident,size)
API_free(ident)
API_memcpy(dest,src,size)
API_write(dest,data,size)
in the above functional interface, api_malloc represents an API interface name of a memory area indicated by an allocation interface parameter; API_free represents the API interface name of the memory area indicated by the release interface parameter; API_memcpy represents the name of the API interface for copying the memory area indicated by the interface parameter; the api_write represents an API interface name for writing to a memory area indicated by an interface parameter, and the api_write may also be implemented by an assignment operator of a programming language class object, for example, an "equal number=" operator for reloading a specific class object in c++, java, python languages;
ident denotes an identifier of a memory area, a memory address or a memory address pointer to be allocated or released, dest denotes an identifier of a memory area, a memory address or a memory address pointer to be written, src denotes an identifier of a memory area, a memory address or a memory address pointer to be read, data denotes data to be written, and size denotes a length of data to be written.
In practical applications, the abstract function parameter read-write tracking module is generally implemented by a memory management interface in a package software architecture and a calculation function call interface of a specific software and hardware calculation processor.
FIG. 6 illustrates a schematic diagram of marking and tracking abstract functions and their interface parameters according to one embodiment of the application. As shown in fig. 6, the application 611 runs on a processor 610, which processor 610 may be CPU, GPU, NPU or other computing accelerator, and the processor 610 includes an abstract function parameter read-write tracking module 612. Data used during the running of the application 611 is stored on the memory 620 with a memory address space of 0x0000000 to 0xFFFFFFFF.
The application 611 is described at runtime by a number of abstract functions. During the running process of the application 611, the abstract function parameter read-write tracking module 612 may mark abstract functions (for example, abstract functions 1, …, abstract function k (k is a positive integer) and interface parameters thereof in fig. 6) through a functional interface for marking, track all read-write operations on the interface parameters of the abstract functions in the life cycle of the application through the functional interface for tracking, and record the tracking result of the interface parameters in the parameter global version table PGVMap and the parameter local version table PLVMap.
For example, as shown in fig. 6, during the running process of the application 611, the abstract function parameter read-write tracking module 612 marks the interface parameters of the abstract function k, including: 2 input parameters (input parameter m1 and input parameter m 2), 1 input/output parameter (m 3), 2 output parameters (output parameter m4 and output parameter m 5). The abstract function parameter read-write tracking module 612 can track read-write operations of interface parameters of the abstract function k, and record tracking results in a parameter global version table PGVMap and a parameter local version table PLVMap.
In step S230, in the case that the computing error of the application program is detected, a target abstract function in which the computing error occurs is determined from the abstract functions.
During the running process of the application program, whether the application program has a calculation error can be detected by means of error correction code (error correction code, ECC) verification, the mode that at least two sets of devices execute the same application program at the same time, at least two processor cores execute the same operator and the like. The detection of the calculation error can adopt various existing detection modes, and the specific detection mode of whether the calculation error occurs to the application program is not limited.
In the case that the computing error of the application program is detected, the error position, for example, the instruction position, the code position, the function, the operator, the thread and the like where the computing error occurs can be determined, and then the target abstract function where the computing error occurs can be determined from the marked abstract functions according to the error position.
Step S240, determining whether the target abstract function has idempotency according to the interface parameters of the target abstract function.
After the target abstract function with calculation errors is determined, the interface parameters of the target abstract function can be analyzed to judge whether the target abstract function has idempotent or not. For example, it may be determined whether there is an overlap between the memory region indicated by the input parameter and the memory region indicated by the output parameter of the target abstract function: if the memory area indicated by the input parameter of the target abstract function is overlapped with the memory area indicated by the output parameter, the target abstract function has no idempotent property; if the memory area indicated by the input parameter of the target abstract function and the memory area indicated by the output parameter do not overlap, the target abstract function has idempotency.
In one possible implementation, when determining whether the target abstract function has idempotency, it may be determined whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table PLVMap. For example, it may be determined whether or not there is the same identifier as the identifier of the input parameter of the target abstract function among the identifiers of the output parameters of the target abstract function in the parameter local version table PLVMap; alternatively, it may be determined whether or not there is the same identifier as the identifier of the output parameter of the target abstract function among the identifiers of the input parameters of the target abstract function in the parameter local version table PLVMap.
If the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table PLVMap, determining that the target abstract function does not have idempotent; if the same identifier does not exist in the identifiers of the interface parameters of the target abstract function in the parameter local version table PLVMap, determining that the target abstract function has idempotent.
For example, in the parameter local version table PLVMap described in the pseudo code below, the same identifier para_d exists in the identifiers of the interface parameters of the abstract function af_c, and therefore, the abstract function af_c does not have idempotency; the identifier of the interface parameter of the abstract function af_d does not have the same identifier, and therefore the abstract function af_d has idempotency; the same identifier para_d exists in the identifiers of the interface parameters of the abstract function af_e, and therefore the abstract function af_e has no idempotent property.
Figure BDA0003443872210000231
By the method, the idempotency of the target abstract function can be analyzed based on the parameter local version table PLVMap, the method is simple and quick, and the processing efficiency can be improved.
Step S250, re-running the target abstract function to repair the computing error, in case the target abstract function has idempotent properties.
Under the condition that the target abstract function has idempotent property, the target abstract function is an idempotent function, namely, the memory area indicated by the input parameters of the target abstract function is not rewritten or destroyed, and under the condition that the input is not re-specified, the repeated operation of the target abstract function can obtain identical output, so that the calculation error can be repaired by re-operating the target abstract function, and the application program can restore the correct calculation process.
FIG. 7 illustrates a flow chart of an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 7, the application fault tolerance processing method further includes:
step S260, searching for a rerun set in the abstract function, in case the target abstract function does not have idempotent.
Wherein a rerun set (re-execution set) may be used to indicate the set of minimum abstract functions that must be rerun to fix a computing error of an application.
Because the target abstract function does not have idempotent, the memory area indicated by the input parameters of the target abstract function may be rewritten or destroyed, and directly re-running the target abstract function can generate unexpected results without re-specifying the input data, the data of the memory area indicated by the input parameters of the target abstract function needs to be recovered first, and then the target abstract function can be re-executed. Multiple abstract functions may be rolled back forward, i.e., other abstract functions may be run before the target abstract function is run to recreate the original input (i.e., correct input data) of the target abstract function.
The set of minimum abstract functions that must be executed to get the same result as a plurality of repeated operations when the target abstract function is rerun, including the target abstract function, may be referred to as a rerun set. The rerun set is an idempotent region.
In one possible implementation, the rerun set may be searched from the abstract function according to the parameter local version table PLVMap and the parameter global version table PGVMap. Because the parameter local version table PLVMap and the parameter global version table PGVMap are simple data structures, the rerun set is searched from the abstract function according to the parameter local version table PLVMap and the parameter global version table PGVMap, and the search efficiency of the rerun set can be improved simply and quickly.
In one possible implementation manner, when searching the rerun set from the abstract functions according to the parameter local version table PLVMap and the parameter global version table PGVMap, the target abstract function may be added to the rerun set first, and then it is determined whether the first parameter marked as invalid exists in the interface parameters of all abstract functions in the rerun set in the parameter local version table PLVMap.
In the parameter local version table PLVMap, if there is a first parameter (i.e., a parameter marked as invalid, whose indicated memory area may have been destroyed or rewritten) in the interface parameters of all the abstract functions in the rerun collection, the rerun collection search fails.
In the parameter local version table PLVMap, if the first parameter does not exist in the interface parameters of all the abstract functions in the rerun set, it can be judged whether the second parameter with the version number lower than the version number in the parameter global version table PGVMap exists in the input parameters of all the abstract functions in the rerun set. That is, the second parameter is an input parameter, and the version number of the second parameter in the parameter local version table PLVMap is lower than the version number thereof in the parameter global version table PGVMap.
Under the condition that second parameters exist in the input parameters of all the abstract functions in the rerun set, the abstract functions, the memory areas of which are indicated by the input parameters in the rerun combination, can be considered to be rewritten or destroyed, and whether first abstract functions with output parameters including the second parameters exist in the abstract functions can be judged according to the parameter local version table PLVMap. The rerun collection search fails in the absence of the first abstract function in the abstract function.
In the case where the first abstract function exists in the abstract functions, it is determined whether the rerun set includes the first abstract function. In the case where there is a first abstract function in the abstract functions and the rerun set does not include the first abstract function, the first abstract function is added to the rerun set.
And then starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set in the parameter local version table PLVMap until the first abstract function is included in the re-running set. For example, assuming that there are 3 second parameters in the input parameters of all the abstract functions in the rerun set, if all the first abstract functions whose output parameters include the respective second parameters are already in the rerun set, the rerun set may be considered to include the first abstract functions.
The above procedure may be referred to as a forward search, by which it can be determined which other abstract functions must be run first to recreate the original input (i.e., correct input data) of the target abstract function in order to re-run the target abstract function.
And judging whether a third parameter with a version number lower than that in the parameter global version table PGVMap in the parameter local version table PLVMap exists in the output parameters of all the abstract functions in the rerun set under the condition that no second parameter exists in the input parameters of all the abstract functions in the rerun set or the rerun set comprises the first abstract function. That is, the third parameter is an output parameter, and the version number of the third parameter in the parameter local version table PLVMap is lower than its version number in the parameter global version table PGVMap.
And under the condition that a third parameter exists in the output parameters of all the abstract functions in the rerun set, judging whether a second abstract function with the input parameters including the third parameter exists in the abstract functions according to the parameter local version table PLVMap. Rerun collection search fails in the absence of a second abstract function in the abstract functions.
In the case where a second abstract function is present in the abstract functions, it is determined whether the rerun set includes the second abstract function. In the event that a second abstract function is present in the abstract function and the rerun set does not include the second abstract function, the second abstract function is added to the rerun set.
And then starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set in the parameter local version table PLVMap until the second abstract function is included in the re-running set. For example, assuming that there are 2 third parameters in the input parameters of all the abstract functions in the rerun set, if the second abstract function whose output parameters include the respective third parameters is already in the rerun set, the rerun set may be considered to include the second abstract function.
The above procedure may be referred to as a backward search, by which it may be determined which abstract functions have to be re-run in order not to destroy the current memory context, i.e. by which abstract functions that restore the memory context to the state of the current parameter local version table PLVMap record can be found.
The rerun set search is successful if no third parameter exists among the output parameters of all the abstract functions in the rerun set, or if the rerun set includes a second abstract function.
FIG. 8 shows a schematic diagram of a process of searching for a rerun collection according to an embodiment of the present application. As shown in fig. 8, when searching the abstract function for the rerun set, step S801 may be performed first, and the target abstract function is added to the rerun set; step S802 is executed, and whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table PLVMap is judged;
if the first parameter exists, step S807 is performed, and the rerun collection search fails;
if the first parameter does not exist, step S803 is executed to determine whether there is a second parameter having a version number lower than that in the parameter global version table PGVMap in the parameter local version table PLVMap in the input parameters of all the abstract functions in the rerun set;
If the second parameter exists, step S804 is executed to determine whether there is a first abstract function whose output parameter includes the second parameter in the abstract function according to the parameter local version table PLVMap; if the first abstract function does not exist in the abstract functions, executing step S807, and re-running the set search fails; if the first abstract function exists in the abstract functions, executing step S805 to determine whether the rerun set includes the first abstract function; if the rerun set does not include the first abstract function, then executing step S806, adding the first abstract function to the rerun set, and then restarting from step S802;
if the second parameter does not exist, or if the rerun set includes the first abstract function, executing step S808, and judging whether there is a third parameter of which the version number in the parameter local version table PLVMap is lower than the version number in the parameter global version table PGVMap in the output parameters of all abstract functions in the rerun set;
if the third parameter does not exist, executing step S812, and re-running the set search successfully;
if the third parameter exists, step S809 is executed to determine whether there is a second abstract function whose input parameter includes the third parameter in the abstract function according to the parameter local version table PLVMap; if the second abstract function does not exist in the abstract functions, executing step S807, and re-running the set search fails; if there is a second abstract function in the abstract functions, executing step S810, and determining whether the rerun set includes the second abstract function; if the rerun set does not include the second abstract function, then executing step S811, adding the second abstract function to the rerun set, and then executing again from step S802;
If the rerun set includes the second abstract function, then step S812 is performed, and the rerun set search is successful.
Step S270, in the case that the rerun set search is successful, running the rerun set to repair the calculation error.
Under the condition that the rerun set search is successful, since the rerun set is an idempotent region, the rerun set can be run to repair the calculation errors, so that the application program can recover the correct calculation process.
In one possible implementation manner, when the rerun set is executed, a local control flow graph corresponding to the rerun set may be generated first according to the parameter local version table PLVMap: according to the identifier of the abstract function and the identifier of the interface parameter of the abstract function recorded by the parameter local version table PLVMap, the input parameter and the output parameter with the same identifier and version number are determined, the abstract function to which the input parameter belongs is determined to run after the abstract function to which the output parameter belongs, and then the local control flow graph corresponding to the re-running set is generated.
For example, the rerun set includes 3 abstract functions, respectively: the af_ C, AF _ D, AF _e, the interface parameters of the running 3 abstract functions recorded in the parameter local version table PLVMap and the version numbers thereof are described as follows by pseudo codes:
Figure BDA0003443872210000261
Figure BDA0003443872210000271
As can be seen from the above pseudo code, the identifier and version number of the output parameter para_d of the abstract function af_c are the same as the identifier and version number of the input parameter para_d of the abstract function af_e, so that the output parameter para_d of the abstract function af_c is the input parameter of the abstract function af_e, and the abstract function af_c is operated prior to the abstract function af_e on the control flow, that is, the abstract function af_e depends on the operation result of the abstract function af_c, and the abstract function af_c is a precondition for the operation of the abstract function af_e.
Similarly, the identifier and version number of the output parameter para_g of the abstract function af_d are the same as the identifier and version number of the input parameter para_g of the abstract function af_e, so that the output parameter para_g of the abstract function af_d is the input parameter of the abstract function af_e, and on the control flow, the abstract function af_d runs before the abstract function af_e, i.e. the abstract function af_e depends on the running result of the abstract function af_d, which is a prerequisite for the abstract function af_e to run.
Fig. 9 shows a schematic diagram of a local control flow graph according to an embodiment of the present application. As shown in fig. 9, the local control flow graph is a local control flow graph corresponding to the above-described rerun set generated according to the parameter local version table PLVMap. The local control flow graph comprises an abstract function AF_ C, AF _ D, AF _E, identifiers and version numbers of input parameters, identifiers and version numbers of output parameters of the abstract functions, and on the control flow, the abstract functions AF_C and AF_D run before the abstract function AF_E, and the output parameters para_d1 and para_g1 of the abstract function AF_C and AF_D serve as the input parameters of the abstract function AF_E.
In one possible implementation, after obtaining a local control flow graph corresponding to the rerun set, the rerun set may be run according to the local control flow graph.
By the method, the local control flow graph of the rerun set can be dynamically generated based on the parameter local version table in the running process of the application program, so that the generation of the local control flow graph is not analyzed in dependence on compiling, the processing speed of the local control flow graph can be simply and rapidly increased, and the running efficiency of the rerun set is further improved.
In one possible implementation manner, the fault-tolerant processing method for an application program in the embodiment of the present application may further establish at least one checkpoint according to a preset checkpointing rule during the running process of the application program. For example, checkpoints may be established before the application completes initialization to begin running; checkpoints may also be established when an application is running to a preset location. Those skilled in the art may set the checkpointing rule and the number of checkpoints according to the actual situation, which is not limited in this application. When checkpointing, the memory context (including but not limited to global variables, stacks, register values, etc.) required for the application to run may be backed up.
In the case of failed rerun set search, a checkpoint closest to the target abstract function may be determined from at least one checkpoint, and the checkpoint may be determined as the target checkpoint, then the memory context corresponding to the target checkpoint (i.e., the memory context backed up when the target checkpoint was established) is used to restore the memory context required for the application to run, and then the application is rerun from the target checkpoint to repair the computing error, so that the application resumes the correct computing process.
In this way, in the event that the rerun collection search fails, the application can be rerun to fix the computational error starting from the checkpoint nearest the target abstract function.
Fig. 10 is a schematic diagram illustrating a processing procedure of an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 10, when the fault-tolerant processing method of the application program according to the embodiment of the present application is used, step S1001 is executed first, and before the application program has completed initializing to start running, a checkpoint is established, where the application program is used to complete a preset computing task; step S1002 is executed, an application program is operated, in the operation process of the application program, the abstract function and the interface parameters thereof are marked and tracked, and tracking results can be recorded in a parameter local version table PLVMap and a parameter global version table PGVMap; and in step S1003, it is determined whether or not a calculation error of the application program is detected;
If the computing error of the application program is not detected, executing step S1004, and outputting a computing result after the application program completes the computing task;
if it is detected that the application program has a calculation error, step S1005 is executed to determine whether the number of times of occurrence of the calculation error of the application program is greater than or equal to a preset error number threshold;
if the number of times of occurrence of the calculation errors of the application program is greater than or equal to the preset error number threshold, executing step S1006 to determine that the calculation errors cannot be repaired and that the execution of the calculation task fails;
if the number of times of calculation errors of the application program is smaller than the preset error number threshold, executing step S1007 to determine a target abstract function with calculation errors from the abstract functions, then executing step S1008 to determine whether the target abstract function has idempotent according to the interface parameters of the target abstract function;
if the target abstract function has idempotent, then step S1009 is performed to rerun the target abstract function to fix the calculation error, and in the process of execution, execution is restarted from step S1002;
if the target abstract function does not have idempotent property, executing step S1010, searching a rerun set in the abstract function according to the parameter local version table PLVMap and the parameter global version table PGVMap, and judging whether the rerun set is successfully searched in step S1011;
If the rerun set search is successful, executing step S1012, running the rerun set to repair the calculation error, and executing step S1002 again in the running process;
if the rerun collection search fails, step S1013 is executed, the checkpoint established in step S1001 is reverted to, the application is rerun, and step S1002 is executed again during the running process.
It should be noted that, in fig. 10, only one checkpoint is taken as an example, the processing procedure of the fault-tolerant processing method for an application program in the embodiment of the present application is illustrated, and those skilled in the art may set the number of checkpoints according to the actual situation, which is not limited in this application.
Based on the above embodiments, in the fault-tolerant processing method for an application program according to the embodiments of the present application, during fault-tolerant processing, the number of checkpoints used can be reduced as much as possible through idempotent analysis (including a target abstract function and a rerun set) during running, so as to improve the performance of fault-tolerant processing.
According to the application program fault tolerance processing method, the idempotent analysis of the single-input single-output (SESE) target abstract function and the rerun set is supported, and the idempotent analysis of the multi-input multi-output (multi-output, MEME) target abstract function and the rerun set is supported, so that the application range of fault tolerance processing based on the idempotent analysis can be expanded.
In one possible implementation, the application of the embodiments of the present application may run on a heterogeneous computing platform including a processor and an accelerator, where during the running of the application, a kernel function (also referred to as an operator) of the processor, which is run by the accelerator, may be launched to the accelerator as an abstract function while the abstract function and its interface parameters are marked and tracked.
Specific applications of the application fault tolerance processing method according to the embodiment of the present application will be exemplarily described with reference to fig. 11a, 11b and fig. 12a and 12 b.
Fig. 11a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 11a is: the application program fault-tolerant processing method is applied to a CPU-GPU heterogeneous computing architecture based on an Inquiry Nvidia GPU.
As shown in FIG. 11a, the Nvidia GPU-based CPU-GPU heterogeneous computing architecture includes a processor CPU 1110, an accelerator GPU 1120, and I/O devices 1130, where the CPU 1110, accelerator GPU 1120, and I/O devices 1130 are connected by buses, noC, and the like. Processor CPU 1110 is physically coupled to main memory (main memory) 1140 and accelerator GPU 1120 is physically coupled to device memory (device memory) 1150, i.e., processor CPU 1110 and accelerator GPU 1120 each have separate memory address spaces.
The GPU runs a computing program that is a Kernel based on a unified computing device architecture (compute unified device architecture, CUDA), which may be referred to as a CUDA Kernel or CUDA operator (CUDA Kernel). The CUDA kernel function is launched (launch) by the CPU to run on the GPU, before running the CUDA kernel function, the input data of the CUDA kernel function is copied to the address specified in the device memory (such as the device memory 1150 in fig. 11 a) connected to the GPU by using the memory copy interface provided by the CUDA, and after the execution of the CUDA kernel function is completed, the output data (i.e. the calculation result of the CUDA kernel function) is copied to the main memory (such as the main memory 1140 in fig. 11 a) connected to the CPU, so that other calculation processes can be participated in the CPU.
The CUDA description tool interface (CUDA profiling tools interface, CUPTI) is included in the Nvidia development suite. The CUPTI interface can intercept all CUDA kernel function emission and related input/output data copying operations, and execute operations such as programmable recording and statistics when intercepting the operations, so that evaluation of GPU computing performance is realized.
Therefore, when the application program fault-tolerant processing method of the embodiment of the application is applied to the CPU-GPU heterogeneous computing architecture based on the Nvidia GPU, the CUDA kernel function sent to the GPU by the CPU can be regarded as an abstract function, and the function of the abstract function parameter read-write tracking module is realized by packaging the CUPTI interface.
FIG. 11b is a schematic diagram showing a processing procedure of an application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 11b, when the CPU 1110 runs an application program, the CUDA kernel function that needs to be run by the GPU 1120 may be sent to the GPU 1120, and meanwhile, input data of the CUDA kernel function is copied from the main memory 1140 to the device memory 1150 through a memory copy interface provided by the CUDA, so as to be used by the CUDA kernel function during running. After the GPU 1120 runs the CUDA kernel, the output data may be stored in the device memory 1150, and copied from the device memory 1150 to the main memory 1140 through a memory copy interface provided by the CUDA.
During the process of running an application program by the CPU 1110, the CUDA kernel function that is sent to the GPU 1120 by the CPU 1110 and runs by the CPU 1110 may be marked as an abstract function through the encapsulated CUPTI interface 1160 (for implementing the function of the abstract function parameter read-write tracking module in the embodiment of the present application), and the interface parameters of the abstract function are marked, and the read-write changes of the interface parameters are tracked, so that the tracking result is recorded in the parameter global version table and the parameter local version table.
When a calculation error occurs in the process of running the CUDA kernel function by the GPU 1120, a message that the calculation error occurs may be sent to the CPU 1110, after the CPU 1110 receives the message that the calculation error occurs sent by the GPU 1120, the CUDA kernel function that the calculation error occurs (i.e., the target abstract function) may be determined, and the idempotency of the CUDA kernel function that the calculation error occurs is analyzed, and if the CUDA kernel function that the calculation error occurs has idempotency, the CUDA kernel function that the calculation error occurs is retransmitted to the GPU to run, so as to repair the calculation error.
If the CUDA kernel function with the computing error does not have idempotent property, the CPU 1110 searches the abstract function for a rerun set according to the parameter global version table and the parameter local version table by the method described in the above embodiment. If the rerun combined search is successful, the rerun set is run to fix the computing error. If the rerun combined search fails, the computing error is repaired by checkpointing.
In this example, the CUPTI interface based on the Nvidia GPU implements idempotent determination and rerun set search of kernel functions. Since typically more than 80% of the kernel functions used are idempotent, it is not necessary to checkpointed the input data before the idempotent kernel functions execute, thus enabling checkpointed granularity to be extended to the entire computational task consisting of multiple kernel functions, e.g., one complete iteration in deep neural network computation, instead of every kernel function in general. This reduces checkpointing overhead from above 10% to below 1% without loss of computational fault tolerance.
The method and the device can realize dynamic idempotent analysis in running, do not need static compiling and idempotent analysis in advance, reduce dependence on source codes and development tools, do not need to make any change to the existing hardware, have no additional requirements on an instruction set of a processor, do not limit the form of idempotent regions (target abstract functions and rerun sets), support analysis of the idempotent regions of multiple inputs and multiple outputs, and have better universality. In addition, the present example may further increase fault tolerance granularity and reduce checkpoints.
Fig. 12a shows a schematic diagram of an application scenario of an application fault tolerance processing method according to an embodiment of the present application. The application scenario shown in fig. 12a is: the application fault tolerance processing method of the embodiment of the application is applied to a computing platform based on a lifting processor (such as an ascent 910/310).
As shown in fig. 12a, in the computing platform based on the rising processor, the rising processor includes a plurality of CPU Cores 1210 (CPU Cores) and a plurality of neural network processor Cores 1220 (Artificial Intelligence Cores, AI Cores), and the plurality of CPU Cores 1210 and the plurality of AI Cores 1220 share the same set of memory devices connected to the memory controller (i.e. the shared memory 1240 shown in fig. 12 a) and share the same physical memory address space. The plurality of CPU cores 1210, the plurality of AI cores 1220, the shared memory 1240, and the I/O device 1230 are connected by a bus, noC, or the like.
The computing platform based on the lifting processor also encapsulates the specific computation steps as Kernel functions, but on the lifting platform, the Kernel numbers are called tensor acceleration engine operators (tensor boost engine Kernel, TBE Kernel), abbreviated as TBE operators. The TBE operator has well defined input-output parameters. The TBE operator may be launched by invoking a lifting computation language (ascend computing language, ACL) runtime library.
Since the CPU core and AI core of the computing platform based on the boot processor share physical memory, it is no longer necessary to copy the input data and output data back and forth between main memory and device memory when the TBE operator is transmitted. However, in order to facilitate programming and memory management, the CPU core for controlling the task flow and the AI core for performing the specific neural network computation operate different virtual address spaces with different memory management interfaces, respectively. The input data given when the TBE operator is transmitted, instead of actually copying the input data from main memory to device memory, an address mapping between the address of the input data in the virtual main memory space and the address of the input data in the virtual device memory space is established. The physical memory space storing the input data and the output data is actually allocated by the ACL runtime library, but can be accessed by both the corresponding address in the virtual main memory space and the corresponding address in the virtual device memory space. Therefore, the ACL runtime library can obtain practically all calls to TBE operators, mappings of memory addresses of input data and output data, and read-write access.
Therefore, when the application fault-tolerant processing method of the embodiment of the application is applied to the computing platform based on the rising processor, the TBE operator sent by the CPU core to the AI core can be regarded as an abstract function, and the function of the abstract function parameter read-write tracking module is realized through a corresponding interface (hereinafter referred to as an ACL runtime interface) of the encapsulation ACL runtime library.
Fig. 12b is a schematic diagram illustrating a processing procedure of the application fault tolerance processing method according to an embodiment of the present application. As shown in fig. 12b, in the computing platform based on the rising processor, the rising processor includes a plurality of CPU cores (CPU cores) 1210 and a plurality of AI cores (AI cores), where the plurality of AI cores include a plurality of AI GPU cores (AI GPU cores) 1221 and a plurality of AI CPU cores (AI CPU cores) 1222.
When the CPU core 1210 runs an application, a TBE operator that needs to be run by an AI core (e.g., AI GPU core 1221 or AI CPU core 1222 in fig. 12 b) may be sent to the AI core, while a mapping between the address of the input data in the virtual main memory space 1250 and the address of the input data in the virtual device memory space 1260 is established through an ACL run-time (ACL run) interface 1270, so that the CPU core 1210 may access the input data in the shared memory 1240 through the address of the input data in the virtual main memory space 1250 and the AI core may access the input data in the shared memory 1240 through the address of the input data in the virtual device memory space 1260. The shared memory 1240 is a physical memory, such as a double rate synchronous dynamic random access memory (double data Rate synchronous dynamic random access memory, DDR), a high bandwidth memory (high bandwidth memory, HBM), etc.
After the AI core runs the TBE operator, memory may be allocated for the output data in the shared memory 1240 through the ACL runtime interface 1270, and the output data may be stored in the allocated memory, while a mapping between the address of the output data in the virtual main memory space 1250 and the address of the output data in the virtual device memory space 1260 is established, so that the CPU core 1210 may access the output data in the shared memory 1240 through the address of the output data in the virtual main memory space 1250, and the AI core may access the output data in the shared memory 1240 through the address of the output data in the virtual device memory space 1260.
During the running of the application program by the CPU core 1210, the TBE operator sent by the CPU core 1210 to the AI core running may be marked as an abstract function by the encapsulated ACL runtime interface 1270 (for implementing the function of the abstract function parameter read-write tracking module in the embodiment of the present application), and the interface parameters of the abstract function are marked, and the read-write changes of the interface parameters are tracked, so that the tracking result is recorded in the parameter global version table and the parameter local version table.
When a calculation error occurs in the process of operating the TBE operator by the AI core, a message of the calculation error occurrence can be sent to the CPU core 1210, after the CPU core 1210 receives the message of the calculation error occurrence sent by the AI core, the TBE operator (namely, a target abstract function) of the calculation error occurrence can be determined, the idempotent of the TBE operator of the calculation error occurrence is analyzed, and if the TBE operator of the calculation error occurrence has idempotent, the TBE operator of the calculation error occurrence is retransmitted to the AI core operation so as to repair the calculation error.
If the TBE operator with the computing error does not have idempotent property, the CPU core 1210 searches the abstract function for a rerun set according to the parameter global version table and the parameter local version table by the method described in the above embodiment. If the rerun combined search is successful, the rerun set is run to fix the computing error. If the rerun combined search fails, the computing error is repaired by checkpointing.
In this example, the ACL runtime interface based on the lifting processor computing platform implements idempotent determination and rerun set search of TBE operators. Since most TBE operators have idempotency, under the condition of adopting the application program fault-tolerant processing method of the embodiment of the application program, the granularity of establishing the check points can be enlarged to the whole calculation task consisting of a plurality of TBE operators, so that the overhead of establishing the check points can be reduced from more than 10% to less than 1% without losing the calculation fault-tolerant capability.
Based on the above examples, when the application fault-tolerant processing method of the embodiment of the application is applied to a heterogeneous computing platform with a shared memory address space, the memory data copy can be avoided by establishing address mappings from the virtual main memory space and the virtual device memory space to the shared physical memory space, thereby improving the computing efficiency.
The application scenario of the fault-tolerant processing method of the application program in the embodiment of the application program is not limited to a specific organization form of a processor core architecture, a physical memory architecture and a memory address space, as long as an interface capable of recording abstract function (kernel function or operator) call and an input/output parameter address or identifier thereof exists on a software architecture of a computing platform.
The application program fault-tolerant processing method can be applied to computing hardware fault tolerance of most computing and data intensive scenes, such as Atlas series plates and server products in a lifting solution product, cloud computing service products based on lifting processors, big data storage and processing scenes, video data processing scenes and the like in the lifting solution product; and the method can be applied to the computation or data-intensive application scene based on the X86 or Nvidia GPU computing platform.
It should be noted that, a specific application scenario of the application fault tolerance processing method of the embodiment of the present application may be determined by a person skilled in the art according to actual situations, which is not limited in this application.
FIG. 13 illustrates a block diagram of an application fault tolerance processing device according to an embodiment of the present application. As shown in fig. 13, the application fault tolerance processing device includes:
A first operation module 1310, configured to operate an application program, where the application program is used to complete a preset computing task;
a marking and tracking module 1320, configured to mark and track an abstract function and an interface parameter of the abstract function during the running process of the application, where the abstract function is used to indicate a function module at a preset level in the application, the interface parameter of the abstract function includes an input parameter and an output parameter, the input parameter is a memory area storing input data of the function module, and the output parameter is a memory area storing output data of the function module;
the target abstract function determining module 1330 is configured to determine, from among the abstract functions, a target abstract function in which a calculation error occurs, if it is detected that the application program has the calculation error;
an idempotency judging module 1340, configured to judge whether the target abstract function has idempotency according to the interface parameters of the target abstract function;
and a second execution module 1350, configured to re-execute the target abstract function to repair the computing error if the target abstract function has idempotent property.
In one possible implementation, the interface parameter has a version number, where the version number is used to indicate a data storage status of the interface parameter, and the marking and tracking module 1320 is configured to: during the running process of the application program, identifying the functional modules in the application program according to the preset level, and marking the identified functional modules as abstract functions; determining interface parameters of the abstract function and version numbers of the interface parameters; and updating a parameter global version table and a parameter local version table according to the interface parameters of the abstract functions and the version numbers of the interface parameters, wherein the parameter global version table is used for recording the latest version numbers of the interface parameters of all abstract functions of the application program, and the parameter local version table is used for recording the version numbers of the interface parameters when all abstract functions of the application program run.
In one possible implementation manner, the updating the global version table and the local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: and adding the identifier of the abstract function, the identifier and the version number of the interface parameter of the abstract function into a parameter local version table.
In one possible implementation manner, the updating the global version table and the local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table; or under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, adding the identifier and the version number of the interface parameters of the abstract function into the parameter global version table.
In one possible implementation manner, the updating the global version table and the local version table according to the interface parameter of the abstract function and the version number of the interface parameter includes: updating the version number of the interface parameter in the parameter global version table under the condition that the interface parameter is modified by other operations outside the abstract function of the application program; and marking the interface parameters with version numbers lower than the version numbers in the parameter global version table in the parameter local version table as invalid.
In one possible implementation, the idempotent determining module 1340 is configured to: judging whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table; in the absence of the same identifier, the target abstract function is determined to have idempotency.
In one possible implementation, the apparatus further includes: a rerun set search module configured to search, in the abstract function, for a rerun set indicating a set of a minimum abstract function that must be rerun to fix the computing error, in the event that the target abstract function does not have idempotent; and the third operation module is used for operating the rerun set to repair the calculation error under the condition that the rerun set search is successful.
In one possible implementation, the rerun collection search module is configured to: searching a rerun set in the abstract function according to the parameter global version table and the parameter local version table.
In one possible implementation manner, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes: adding the target abstract function into a rerun set; judging whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table; judging whether second parameters with version numbers lower than those in the parameter global version table exist in the input parameters of all abstract functions in the rerun set under the condition that the first parameters do not exist; judging whether a first abstract function with output parameters including the second parameters exists in the abstract function according to the parameter local version table under the condition that the second parameters exist; judging whether the rerun set comprises the first abstract function or not under the condition that the first abstract function exists in the abstract functions; adding the first abstract function to the rerun set if the rerun set does not include the first abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the first abstract function.
In one possible implementation manner, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes: judging whether third parameters with version numbers lower than those in the parameter global version table exist in the output parameters of all abstract functions in the rerun set under the condition that the second parameters do not exist or the rerun set comprises the first abstract function; judging whether a second abstract function with input parameters including the third parameter exists in the abstract function according to the parameter local version table under the condition that the third parameter exists; judging whether the rerun set comprises the second abstract function or not under the condition that the second abstract function exists in the abstract functions; adding the second abstract function to the rerun set if the rerun set does not include the second abstract function; and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the second abstract function.
In one possible implementation manner, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes: the rerun set search is successful in the absence of the third parameter or the rerun set includes the second abstract function.
In one possible implementation manner, the searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table includes any one of the following: if the first parameter exists, the rerun set search fails; the rerun set search fails in the absence of the first abstract function in the abstract function; the rerun collection search fails in the absence of the second abstract function in the abstract function.
In one possible implementation manner, the third operation module is configured to: generating a local control flow graph corresponding to the rerun set according to the parameter local version table; and operating the rerun set according to the local control flow graph.
In one possible implementation, the apparatus further includes: the checkpointing module is used for setting up at least one checkpoint according to preset checkpointing rules in the running process of the application program; the target check point determining module is used for determining a target check point closest to the target abstract function from the at least one check point under the condition that the rerun set search fails; and the fourth running module is used for starting from the target check point and re-running the application program so as to repair the computing error.
In one possible implementation, the application runs on a heterogeneous computing platform including a processor and an accelerator, and the marking and tracking module 1320 is configured to: during the running of the application, the kernel function that the processor transmits to the accelerator is marked as an abstract function.
The embodiment of the application provides an application fault tolerance processing device, which comprises: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions.
Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
Embodiments of the present application provide a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disk, hard disk, random Access Memory (Random Access Memory, RAM), read Only Memory (ROM), erasable programmable Read Only Memory (Electrically Programmable Read-Only-Memory, EPROM or flash Memory), static Random Access Memory (SRAM), portable compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Video Disc, DVD), memory stick, floppy disk, mechanical coding devices, punch cards or in-groove protrusion structures having instructions stored thereon, and any suitable combination of the foregoing.
The computer readable program instructions or code described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN) or a wide area network (Wide Area Network, WAN), or it may be connected to an external computer (e.g., through the internet using an internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field programmable gate arrays (Field-Programmable Gate Array, FPGA), or programmable logic arrays (Programmable Logic Array, PLA), with state information of computer readable program instructions.
Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., circuits or ASICs (Application Specific Integrated Circuit, application specific integrated circuits)) which perform the corresponding functions or acts, or combinations of hardware and software, such as firmware, etc.
Although the invention is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (33)

1. An application fault tolerance processing method, the method comprising:
running an application program, wherein the application program is used for completing a preset calculation task;
in the running process of the application program, marking and tracking an abstract function and interface parameters of the abstract function, wherein the abstract function is used for indicating a function module with a preset level in the application program, the interface parameters of the abstract function comprise input parameters and output parameters, the input parameters are memory areas for storing input data of the function module, and the output parameters are memory areas for storing output data of the function module;
under the condition that the computing error of the application program is detected, determining a target abstract function with the computing error from the abstract functions;
judging whether the target abstract function has idempotency or not according to the interface parameters of the target abstract function;
and re-running the target abstract function to repair the computing error under the condition that the target abstract function has idempotency.
2. The method of claim 1, wherein the interface parameter has a version number indicating a data storage status of the interface parameter,
The marking and tracking abstract functions and interface parameters of the abstract functions comprises:
during the running process of the application program, identifying the functional modules in the application program according to the preset level, and marking the identified functional modules as abstract functions;
determining interface parameters of the abstract function and version numbers of the interface parameters;
updating a parameter global version table and a parameter local version table according to the interface parameters of the abstract function and the version numbers of the interface parameters,
wherein the parameter global version table is used for recording the latest version numbers of interface parameters of all abstract functions of the application program,
the parameter local version table is used for recording the version numbers of the interface parameters of the application program when each abstract function runs.
3. The method according to claim 2, wherein updating the global version table and the local version table of parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
and adding the identifier of the abstract function, the identifier and the version number of the interface parameter of the abstract function into a parameter local version table.
4. A method according to claim 2 or 3, wherein updating the global version table and the local version table of parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
Updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table; or (b)
And under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, adding the identifier and the version number of the interface parameters of the abstract function into the parameter global version table.
5. The method according to any one of claims 2-4, wherein updating the global version table and the local version table of parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
updating the version number of the interface parameter in the parameter global version table under the condition that the interface parameter is modified by other operations outside the abstract function of the application program;
and marking the interface parameters with version numbers lower than the version numbers in the parameter global version table in the parameter local version table as invalid.
6. The method according to any one of claims 2-5, wherein the determining whether the target abstract function has idempotency according to the interface parameters of the target abstract function comprises:
Judging whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table;
in the absence of the same identifier, the target abstract function is determined to have idempotency.
7. The method according to any one of claims 2-6, further comprising:
searching, in the abstract function, a rerun set for indicating a set of minimum abstract functions that must be rerun to fix the computing error, in the case that the target abstract function does not have idempotent;
and under the condition that the rerun set search is successful, the rerun set is operated to repair the calculation error.
8. The method of claim 7, wherein searching for a rerun set in the abstract function comprises:
searching a rerun set in the abstract function according to the parameter global version table and the parameter local version table.
9. The method of claim 8, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
Adding the target abstract function into a rerun set;
judging whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table;
judging whether second parameters with version numbers lower than those in the parameter global version table exist in the input parameters of all abstract functions in the rerun set under the condition that the first parameters do not exist;
judging whether a first abstract function with output parameters including the second parameters exists in the abstract function according to the parameter local version table under the condition that the second parameters exist;
judging whether the rerun set comprises the first abstract function or not under the condition that the first abstract function exists in the abstract functions;
adding the first abstract function to the rerun set if the rerun set does not include the first abstract function;
and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the first abstract function.
10. The method of claim 9, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
judging whether third parameters with version numbers lower than those in the parameter global version table exist in the output parameters of all abstract functions in the rerun set under the condition that the second parameters do not exist or the rerun set comprises the first abstract function;
judging whether a second abstract function with input parameters including the third parameter exists in the abstract function according to the parameter local version table under the condition that the third parameter exists;
judging whether the rerun set comprises the second abstract function or not under the condition that the second abstract function exists in the abstract functions;
adding the second abstract function to the rerun set if the rerun set does not include the second abstract function;
and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the second abstract function.
11. The method of claim 10, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
the rerun set search is successful in the absence of the third parameter or the rerun set includes the second abstract function.
12. The method according to claim 10, wherein searching the rerun set in the abstract function according to the parameter global version table and the parameter local version table comprises any one of the following:
if the first parameter exists, the rerun set search fails;
the rerun set search fails in the absence of the first abstract function in the abstract function;
the rerun collection search fails in the absence of the second abstract function in the abstract function.
13. The method of claim 7, wherein the running the rerun collection comprises:
generating a local control flow graph corresponding to the rerun set according to the parameter local version table;
And operating the rerun set according to the local control flow graph.
14. The method according to any one of claims 7-13, characterized in that the method further comprises:
in the running process of the application program, at least one check point is established according to a preset check point establishing rule;
under the condition that the rerun set search fails, determining a target check point nearest to the target abstract function from the at least one check point;
starting from the target checkpoint, re-running the application to fix the computing error.
15. The method of any one of claims 1-14, wherein the application runs on a heterogeneous computing platform comprising a processor and an accelerator,
the marking and tracking abstract functions and interface parameters of the abstract functions comprises:
during the running of the application, the kernel function that the processor transmits to the accelerator is marked as an abstract function.
16. An application fault tolerant processing apparatus, the apparatus comprising:
the first operation module is used for operating an application program, and the application program is used for completing a preset calculation task;
The marking and tracking module is used for marking and tracking an abstract function and interface parameters of the abstract function in the running process of the application program, wherein the abstract function is used for indicating a function module with a preset level in the application program, the interface parameters of the abstract function comprise input parameters and output parameters, the input parameters are memory areas for storing input data of the function module, and the output parameters are memory areas for storing output data of the function module;
the target abstract function determining module is used for determining a target abstract function with the calculation error from the abstract functions under the condition that the calculation error of the application program is detected;
the idempotent judging module is used for judging whether the target abstract function has idempotent according to the interface parameters of the target abstract function;
and the second operation module is used for rerun the target abstract function to repair the calculation error under the condition that the target abstract function has idempotency.
17. The apparatus of claim 16, wherein the interface parameter has a version number indicating a data storage status of the interface parameter,
The marking and tracking module is configured to:
during the running process of the application program, identifying the functional modules in the application program according to the preset level, and marking the identified functional modules as abstract functions;
determining interface parameters of the abstract function and version numbers of the interface parameters;
updating a parameter global version table and a parameter local version table according to the interface parameters of the abstract function and the version numbers of the interface parameters,
wherein the parameter global version table is used for recording the latest version numbers of interface parameters of all abstract functions of the application program,
the parameter local version table is used for recording the version numbers of the interface parameters of the application program when each abstract function runs.
18. The apparatus of claim 17, wherein updating the global version table and the local version table of parameters based on the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
and adding the identifier of the abstract function, the identifier and the version number of the interface parameter of the abstract function into a parameter local version table.
19. The apparatus according to claim 17 or 18, wherein updating the parameter global version table and the parameter local version table according to the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
Updating the version number of the interface parameter of the abstract function in the parameter global version table under the condition that the interface parameter of the abstract function exists in the parameter global version table; or (b)
And under the condition that the interface parameters of the abstract function do not exist in the parameter global version table, adding the identifier and the version number of the interface parameters of the abstract function into the parameter global version table.
20. The apparatus according to any one of claims 17-19, wherein updating the global version table and the local version table of parameters according to the interface parameters of the abstract function and the version numbers of the interface parameters comprises:
updating the version number of the interface parameter in the parameter global version table under the condition that the interface parameter is modified by other operations outside the abstract function of the application program;
and marking the interface parameters with version numbers lower than the version numbers in the parameter global version table in the parameter local version table as invalid.
21. The apparatus of any of claims 17-20, wherein the idempotency determination module is configured to:
judging whether the same identifier exists in the identifiers of the interface parameters of the target abstract function in the parameter local version table;
In the absence of the same identifier, the target abstract function is determined to have idempotency.
22. The apparatus according to any one of claims 17-21, wherein the apparatus further comprises:
a rerun set search module configured to search, in the abstract function, for a rerun set indicating a set of a minimum abstract function that must be rerun to fix the computing error, in the event that the target abstract function does not have idempotent;
and the third operation module is used for operating the rerun set to repair the calculation error under the condition that the rerun set search is successful.
23. The apparatus of claim 22, wherein the rerun collection search module is configured to:
searching a rerun set in the abstract function according to the parameter global version table and the parameter local version table.
24. The apparatus of claim 23, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
adding the target abstract function into a rerun set;
Judging whether first parameters marked as invalid exist in interface parameters of all abstract functions in the rerun set in the parameter local version table;
judging whether second parameters with version numbers lower than those in the parameter global version table exist in the input parameters of all abstract functions in the rerun set under the condition that the first parameters do not exist;
judging whether a first abstract function with output parameters including the second parameters exists in the abstract function according to the parameter local version table under the condition that the second parameters exist;
judging whether the rerun set comprises the first abstract function or not under the condition that the first abstract function exists in the abstract functions;
adding the first abstract function to the rerun set if the rerun set does not include the first abstract function;
and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the first abstract function.
25. The apparatus of claim 24, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
judging whether third parameters with version numbers lower than those in the parameter global version table exist in the output parameters of all abstract functions in the rerun set under the condition that the second parameters do not exist or the rerun set comprises the first abstract function;
judging whether a second abstract function with input parameters including the third parameter exists in the abstract function according to the parameter local version table under the condition that the third parameter exists;
judging whether the rerun set comprises the second abstract function or not under the condition that the second abstract function exists in the abstract functions;
adding the second abstract function to the rerun set if the rerun set does not include the second abstract function;
and starting to re-execute from the step of judging whether the first parameter marked as invalid exists in the interface parameters of all the abstract functions in the re-running set until the re-running set comprises the second abstract function.
26. The apparatus of claim 25, wherein searching for a rerun set in the abstract function according to the parametric global version table and the parametric local version table comprises:
the rerun set search is successful in the absence of the third parameter or the rerun set includes the second abstract function.
27. The apparatus of claim 25, wherein searching for a rerun set in the abstract function according to the parameter global version table and the parameter local version table comprises any one of:
if the first parameter exists, the rerun set search fails;
the rerun set search fails in the absence of the first abstract function in the abstract function;
the rerun collection search fails in the absence of the second abstract function in the abstract function.
28. The apparatus of claim 22, wherein the third operation module is configured to:
generating a local control flow graph corresponding to the rerun set according to the parameter local version table;
And operating the rerun set according to the local control flow graph.
29. The apparatus according to any one of claims 22-28, wherein the apparatus further comprises:
the checkpointing module is used for setting up at least one checkpoint according to preset checkpointing rules in the running process of the application program;
the target check point determining module is used for determining a target check point closest to the target abstract function from the at least one check point under the condition that the rerun set search fails;
and the fourth running module is used for starting from the target check point and re-running the application program so as to repair the computing error.
30. The apparatus of any of claims 16-29, wherein the application runs on a heterogeneous computing platform comprising a processor and an accelerator,
the marking and tracking module is configured to:
during the running of the application, the kernel function that the processor transmits to the accelerator is marked as an abstract function.
31. An application fault tolerant processing apparatus, comprising:
a processor;
A memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1-15 when executing the instructions.
32. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-15.
33. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, causes a processor in the electronic device to perform the method of any one of claims 1-15.
CN202111641381.8A 2021-12-29 2021-12-29 Fault-tolerant processing method and device for application program and storage medium Pending CN116414586A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111641381.8A CN116414586A (en) 2021-12-29 2021-12-29 Fault-tolerant processing method and device for application program and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111641381.8A CN116414586A (en) 2021-12-29 2021-12-29 Fault-tolerant processing method and device for application program and storage medium

Publications (1)

Publication Number Publication Date
CN116414586A true CN116414586A (en) 2023-07-11

Family

ID=87056523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111641381.8A Pending CN116414586A (en) 2021-12-29 2021-12-29 Fault-tolerant processing method and device for application program and storage medium

Country Status (1)

Country Link
CN (1) CN116414586A (en)

Similar Documents

Publication Publication Date Title
US11249879B2 (en) Time-travel debugging with hot code replacement
US9836354B1 (en) Automated error detection and recovery for GPU computations in a service environment
US20110246823A1 (en) Task-oriented node-centric checkpointing (toncc)
US10339015B2 (en) Maintaining system reliability in a CPU with co-processors
CN101308471B (en) Method and device for data restoration
Losada et al. Resilient MPI applications using an application-level checkpointing framework and ULFM
Pourghassemi et al. cudacr: An in-kernel application-level checkpoint/restart scheme for cuda-enabled gpus
EP3652648B1 (en) Replaying time-travel traces relying on processor undefined behavior
Walters et al. Software-based fault tolerance for the Maestro many-core processor
US20130024675A1 (en) Return address optimisation for a dynamic code translator
Siavvas et al. Optimum interval for application-level checkpoints
US11836070B2 (en) Reducing trace recording overheads with targeted recording via partial snapshots
Chajed et al. Certifying a file system using crash hoare logic: correctness in the presence of crashes
Guo et al. Match: An mpi fault tolerance benchmark suite
US11194702B2 (en) History based build cache for program builds
US9287005B2 (en) Detecting missing write to cache/memory operations
Dou et al. ShortCut: accelerating mostly-deterministic code regions
US10733065B2 (en) Recovery of local resource
CN116414586A (en) Fault-tolerant processing method and device for application program and storage medium
CN115840691A (en) Remote repair of crash processes
Parasyris et al. Co-designing multi-level checkpoint restart for mpi applications
Losada et al. A portable and adaptable fault tolerance solution for heterogeneous applications
Huang et al. Selective symbolic type-guided checkpointing and restoration for autonomous vehicle repair
Losada et al. Extending an application-level checkpointing tool to provide fault tolerance support to OpenMP applications
US20240160994A1 (en) Dynamic checkpoint for simulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination