CN114327989A - Fault tolerant method, computer system, apparatus, electronic device and storage medium - Google Patents

Fault tolerant method, computer system, apparatus, electronic device and storage medium Download PDF

Info

Publication number
CN114327989A
CN114327989A CN202111675346.8A CN202111675346A CN114327989A CN 114327989 A CN114327989 A CN 114327989A CN 202111675346 A CN202111675346 A CN 202111675346A CN 114327989 A CN114327989 A CN 114327989A
Authority
CN
China
Prior art keywords
error
computer
type
task
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111675346.8A
Other languages
Chinese (zh)
Inventor
汪小益
李伟
刘毅恒
蔡亮
尚璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Qulian Technology Co Ltd
Original Assignee
Hangzhou Qulian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Qulian Technology Co Ltd filed Critical Hangzhou Qulian Technology Co Ltd
Priority to CN202111675346.8A priority Critical patent/CN114327989A/en
Publication of CN114327989A publication Critical patent/CN114327989A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The embodiment of the application provides a fault-tolerant method, a computer system, a device, electronic equipment and a storage medium, wherein the computer system comprises a plurality of computer nodes, each computer node is provided with a fault-tolerant middleware, a plurality of processing mechanisms are preset in the fault-tolerant middleware, and when an error in operation is detected in the process of operating a task by the computer nodes, the fault-tolerant middleware judges the type of the error, the error of a first type is an unrecoverable error, and the error of a second type is a recoverable error. Stopping running the task when the error is a first type of error; and when the error is of the second type, processing the error based on a preset processing mechanism. The method provided by the embodiment of the application enables the computer system to allow the second type of errors to occur, avoids directly quitting the task when the computer node has the error, can improve the fault tolerance of the computer system, and improves the stability of the computer system.

Description

Fault tolerant method, computer system, apparatus, electronic device and storage medium
Technical Field
The present application relates to the field of computer application technologies, and in particular, to a fault tolerance method, a computer system, an apparatus, an electronic device, and a storage medium.
Background
People generate a large amount of data in work, entertainment and other activities, and when the processing power of a single computer cannot meet the computing and storage requirements of the data, a distributed system is considered. The distributed system is composed of a plurality of computers which are coordinated with each other as nodes, one task is divided into a plurality of subtasks, the plurality of subtasks are distributed to the plurality of nodes for processing, and in the task running process, errors can occur in some nodes, such as network errors, so that the running of the whole task is terminated.
At present, a method for solving errors is to set a main node and a standby node, where the main node and the standby node are responsible for processing the same subtask, and when an error occurs in operation of the main node, a node executing the task is switched to the standby node. However, the primary node and the standby node may simultaneously make errors, resulting in the termination of the entire task operation.
Disclosure of Invention
In view of the foregoing technical problems, embodiments of the present application provide a fault tolerance method, a computer system, an apparatus, an electronic device, and a storage medium, which avoid the problem of task operation termination when a master node and a standby node simultaneously generate an error by setting a processing mechanism for processing the error.
In a first aspect, an embodiment of the present application provides a fault tolerance method applied to a computer system, where the computer system includes a plurality of computer nodes, and the method includes:
in the process of running a task by a computer node, judging the type of an error when detecting that the computer node runs the error; stopping running the task when the error is a first type of error, wherein the first type of error is an unrecoverable error; and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
Specifically, when the error is of the second type, processing the error based on a preset processing mechanism includes: detecting whether first information is received, wherein the first information is used for presetting a processing mechanism; when the first information is received, processing the error according to the first information.
Specifically, the processing mechanisms are divided into a plurality of levels according to the time urgency, and the first information is used to select one level of processing mechanism from the plurality of levels of processing mechanisms as a preset processing mechanism.
In particular, the first information is received when the computer node receives a task processing request.
The first information may be set by the task initiator, and the first information may include a processing mechanism customized by the task initiator.
In particular, the method further comprises: and restarting the computer node after the computer node stops running the task.
Specifically, restarting a computer node after the computer node stops running a task, includes: obtaining the operation result of the last time that the computer node stores before stopping the operation task; and restarting the computer node according to the operation result stored for the last time before the computer node stops operating the task.
In a second aspect, the present application provides a computer system for implementing the fault tolerance method according to the first aspect.
The computer system comprises a plurality of computer nodes and a plurality of fault-tolerant middleware, wherein each computer node is provided with the fault-tolerant middleware, a processing mechanism for processing errors is preset in the fault-tolerant middleware, and when an error occurs in a computer node operation task, the fault-tolerant method in the first aspect is realized by using the fault-tolerant middleware.
In a third aspect, an embodiment of the present application provides a fault tolerance apparatus, which is applied to a computer system, where the computer system includes a plurality of computer nodes, and the apparatus includes:
the detection module is used for judging the type of an error when detecting that the computer node has an error in running in the process of running the task by the computer node;
the processing module is used for stopping running the task when the error is a first type of error, and the first type of error is an unrecoverable error; and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
In a fourth aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the fault tolerance method according to the first aspect when executing the computer program.
In a fifth aspect, embodiments of the present application further provide a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the fault tolerance method according to the first aspect.
In a sixth aspect, the present application further provides a computer program product, which includes a computer program and when the computer program product runs on a computer, the fault tolerance method according to the first aspect is implemented.
The embodiment of the application provides a fault tolerance method, wherein in the process of running a task by a computer node, when an error in the running of the computer node is detected, the type of the error is judged, wherein the error is divided into a first type and a second type, the error of the first type is an unrecoverable error, and the error of the second type is a recoverable error. When the error is of the first type, the computer node is indicated to be incapable of automatically adjusting the error, and the computer node stops running the task; and when the error is of the second type, processing the error based on a preset processing mechanism, so that the computer node can normally operate and process the task. The method provided by the embodiment of the application enables the computer system to allow the second type of errors to occur, prevents the computer node from directly quitting the task once the error occurs, can improve the fault tolerance of the computer system, and improves the stability of the computer system.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic structural diagram of a computer system according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a fault-tolerant middleware provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a fault tolerant logic processing module according to an embodiment of the present disclosure;
FIG. 4 is a flow chart illustrating a fault tolerance method according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more, and "at least one", "one or more" means one, two or more, unless otherwise specified.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
People generate a large amount of data in work, entertainment and other activities, and when the processing power of a single computer cannot meet the computing and storage requirements of the data, a distributed system is considered. The distributed system is composed of a plurality of computers which are coordinated with each other as nodes, one task is divided into a plurality of subtasks, the plurality of subtasks are distributed to the plurality of nodes for processing, and in the task running process, some nodes can be mistakenly disconnected, so that the whole task is terminated.
At present, a method for solving the problem of node errors is to set a main node and a standby node, where the main node and the standby node are responsible for processing the same subtask, and when the main node has an error in operation, the node executing the task is switched to the standby node. However, the primary node and the standby node may be simultaneously operated, and the operation may be switched off due to errors, so that the whole task operation is terminated. If the distributed system performs a large-scale operation task, if the whole task is a calculation-intensive task, the time is tens of hours, and the whole task exits due to disconnection of some nodes, the task processing efficiency is extremely low.
Therefore, the embodiment of the application provides a computer system, which comprises a plurality of computer nodes, wherein each computer node is provided with a fault-tolerant middleware, and a plurality of error processing mechanisms are preset in the fault-tolerant middleware.
The middleware is a kind of software between the application system and the system software, and it uses the basic service provided by the system software to connect each part of the application system or different applications on the network, so as to achieve the purpose of resource sharing and function sharing. The fault-tolerant middleware provides fault-tolerant infrastructure for the distributed system, combines the advantages of the middleware technology, improves the fault-tolerant capability of the distributed system, simplifies the fault-tolerant development of the distributed system and realizes the fault-tolerant function.
On the basis of the computer system, the application also provides a fault-tolerant method, wherein in the process of running the task by the computer node, when an error in the running of the computer node is detected, the fault-tolerant middleware is used for judging the type of the error, wherein the type of the error is divided into a first type and a second type, the error of the first type is an unrecoverable error, and the error of the second type is a recoverable error. Stopping running the computer node when the error is a first type of error; and when the error is of the second type, processing the error based on a preset processing mechanism, so that the computer node can normally operate and process the task.
According to the method provided by the embodiment of the application, when the second type of errors occur in the computer node, the errors are firstly processed instead of directly reporting the faults, so that the second type of errors can be allowed to occur in the computer system, the fault tolerance of the computer system can be improved, and the stability of the computer system is improved.
Fig. 1 is a computer system 100 according to an embodiment of the present disclosure, where the computer system 100 includes a plurality of computer nodes 101, each of the computer nodes is configured with a fault-tolerant middleware 102, the computer nodes 101 may communicate with each other through the fault-tolerant middleware 102, and cooperatively process a task, for example, a certain computer node 101 serves as a task initiator to initiate a cooperative computing request to other computer nodes 101, and the plurality of computer nodes 101 cooperatively process a computing task.
The structure of the fault-tolerant middleware 102 provided in the embodiment of the present application is shown in fig. 2, where the fault-tolerant middleware 102 includes a network module, a computing module, a persistent storage module, a fault-tolerant logic processing module, a state recovery module, and a master/slave node switching module.
The network module stores the communication connection relationship between all computer nodes 101 in the computer system 100, and the network module can capture the information sent by the computer node to other computers and the information received by other computer nodes.
The communication connection relationship between the computer nodes 101 in the network module may be dynamically changed, and the network module may dynamically update the original communication connection relationship according to the operating states of some computer nodes.
In the embodiment of the application, the network module can be set by a user, for example, when the task initiator sends a computing task to the computer node, the task initiator also sends a custom network module to the computer node, and when the fault-tolerant middleware of the computer node receives the custom network module, the original network module is replaced by the custom network module.
The computing module provides computing capability for the computer nodes, the computer nodes are provided with function logic, the functions are registered in the computing module of the fault-tolerant middleware, the functions can be executed and called in a fault-tolerant middleware mode when the computer nodes execute computing tasks, and the computing module of the fault-tolerant middleware can determine input parameters and output parameters of all the functions, so that when errors are computed, the fault-tolerant middleware can capture the errors in the computing logic and process the errors. Meanwhile, the computing module transmits the input parameters and the output parameters to the persistent storage module, and the task operation results can be stored.
The persistent storage module is used for storing all intermediate states in the task running process, including input parameters and output parameters of the computing module, information which is captured by the network module and is sent by the computer node to other computers in a computing mode, and information which is received by other computer nodes in the computing mode. The persistent storage module can support dynamic termination or suspension of tasks, and according to the information stored by the persistent storage module, when the computer node has an error in operation, the state of the computer node can be recovered.
The fault-tolerant logic processing module provides a set of fault-tolerant processing logic, the computing module and the network module send the captured errors to the fault-tolerant logic processing module, the fault-tolerant logic processing module carries out fault-tolerant processing on the errors of the second type according to the fault-tolerant processing logic, and the fault-tolerant processing logic can be logic set for the fault-tolerant middleware when the fault-tolerant middleware is configured or custom fault-tolerant processing logic sent to the computer node when a task initiator sends a computing task to the computer node.
The state recovery module is used for restarting the computer node under the condition that the computer node stops running or stops tasks. And the state recovery module acquires the input parameters and the output parameters which are stored for the last time before the computer node stops running the computing task from the persistent storage module, recovers the task state before the error according to the input parameters and the output parameters which are stored for the last time before the computer node stops running the computing task, and continues to perform the task from the recovered task state.
According to the embodiment of the application, the input parameters and the output parameters of the function are stored, so that the task state of the computer node can be recovered at the fine granularity of the function level, redundant calculation and network transmission are not needed, and the calculation efficiency can be improved. If the task state of the computer node is restored by using the information sent by the computer node to other computer nodes and the received information of other computer nodes, which is captured by the network module, that is, the restoration granularity is coarse, the computer node needs to calculate again according to the information captured by the network module in the restoration process, and the operation that is completed before the computer node stops running the calculation task will be repeated.
The main node and standby node switching module is used for switching the main node into the standby node when the main node stops running, namely, after the second type of errors are processed, the computer node still cannot recover running, and the computer node can be switched into the standby node. The standby node and the main node share one fault-tolerant middleware, and the standby node takes over the main node to continue to perform tasks after the state recovery module recovers the task information.
When the standby node has an error in operation, the fault-tolerant logic processing module is used for processing, when the second type of error is processed and the standby node still cannot recover to operate, the computer node is kicked out of the task, and other computer nodes in the computer system cooperatively calculate, so that the calculation task can still be performed under the condition that a small number of computer nodes are disconnected.
Specifically, fig. 3 shows a schematic structural diagram of a fault tolerant logic processing module according to an embodiment of the present application. Wherein, fault tolerant logic processing module includes three levels: user layer, framework layer and network layer.
The network layer is used for acquiring errors occurring in the computer nodes and judging the types of the errors, in the embodiment of the application, the types of the errors are divided into a first type and a second type, the first type is an unrecoverable error, such as a read-write error, and the second type is a recoverable error, such as a network error and a network stuck error. The fault-tolerant middleware may preset the second type of errors, i.e. which errors need to be fault-tolerant, when encapsulating. The user can also customize a second type of error which needs to be fault-tolerant when requesting the computer node to process the task, and send the error to the network layer, so that the network layer can judge the type of the error according to the user-defined content.
The framework layer sets different processing modules aiming at different errors, wherein the first module is used for directly terminating tasks, the second module is used for default fault-tolerant processing, and the third module is used for custom error processing. The framework layer will fault-tolerant the second type of error using the second and third modules. For the first type of error, the first module class is used, and the task termination logic is directly entered, namely the computer node stops running the task.
The second module is a fault-tolerant processing logic (i.e., a processing mechanism) preset in the fault-tolerant middleware during packaging, the second type of errors include multiple errors, different processing logics can be set for different errors, and different processing logics can be set for the same error or different processing logics. For example, the processing logic is divided into a plurality of levels according to the time urgency, for example, the levels are divided into level 1, level 2, level 3, etc., level 1 may be to delete a computer node when the computer node has an error, and the rest computer nodes continue to be in a task, level 2 may be to suspend all computer nodes, wait for the computer node to recover, and delete the computer node if the computer node is not recovered after exceeding a preset time period, and level 3 may be to directly delete the computer node.
For the third module, when the user layer transmits the user-defined processing logic and the error of the computer node is the error in the user-defined processing logic, the fault-tolerant logic processing module processes according to the user-defined processing logic.
The user layer provides an interface for a user, the user can transmit the error type set by the user and the processing logic defined by the user, or the user uses the fault-tolerant processing logic preset by the fault-tolerant middleware during packaging, when the fault-tolerant processing logic is preset for grading during packaging, the user can define the fault-tolerant processing grade, and when the error occurs in a computer node, the processing logic corresponding to the grade is preferentially executed.
For example, a task initiator sends custom fault-tolerant processing logic to a computer node when sending a computing task to the computer node. Before processing the error, the fault-tolerant logic processing module detects whether first information is received, wherein the first information comprises user-defined error processing logic input by a user or a fault-tolerant processing level input by the user. When the first information is received, the fault-tolerant logic processing module processes the error according to the processing logic corresponding to the first information, and when the first information is not received, the fault-tolerant logic processing module processes the error according to the fault-tolerant processing logic preset during packaging of the default fault-tolerant middleware.
Based on the computer system, the fault-tolerant middleware and the fault-tolerant logic processing module, the embodiment of the application provides a fault-tolerant method, which comprises the following steps:
s401: and in the process of running the task by the computer node, judging the type of the error when detecting that the computer node runs in error.
S402: stopping running the task when the error is a first type of error, wherein the first type of error is an unrecoverable error; and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
And under the condition that the fault-tolerant processing module receives the user-defined error processing type, judging whether the error is a first type of error or a second type of error based on the user-defined error processing type. And in the case that the fault-tolerant processing module does not receive the user-defined error processing type, judging whether the error is the first type of error or the second type of error based on default setting.
And when the error is the error of the first type, processing the error by using a task termination module of the framework layer, and directly terminating the running task.
And when the error is the second type of error, detecting whether the user-defined error processing logic input by the user is received, and processing the error by using the user-defined error processing logic under the condition of receiving the user-defined error processing logic input by the user.
In the case where no custom error handling logic is received for user input, the error is handled using default fault tolerant handling logic. When the default fault-tolerant processing logic is divided into a plurality of levels, whether a user-defined fault-tolerant processing level input by a user is received or not is detected, and under the condition that the level input by the user is received, the error is processed according to the processing logic corresponding to the level input by the user. And under the condition that the grade input by the user is not received, processing the error according to the processing logic corresponding to the default processing grade.
After the error cannot be processed and the computer node stops running the computing task, the embodiment of the application can restart the computer node to run the computing task according to the running result stored for the last time before the computer node stops running the computing task.
Details which are not mentioned in the embodiments of the present application are referred to the embodiments shown in fig. 1 to 3 described above.
In summary, the embodiment of the present application provides a computer system, which includes a plurality of computer nodes, each computer node is configured with a fault-tolerant middleware, a plurality of processing mechanisms are preset in the fault-tolerant middleware, and when an error occurs in operation of a computer node is detected, the error is processed.
The application also provides a fault-tolerant method, wherein when the fault of the computer node operation task is detected, the fault-tolerant middleware is used for judging the type of the fault, the first type of the fault is an unrecoverable fault, and the second type of the fault is a recoverable fault. Stopping running the task when the error is a first type of error; and when the error is of the second type, processing the error based on a preset processing mechanism, so that the computer node can normally operate and process the task. According to the method provided by the embodiment of the application, when the second type of errors occur in the computer node, the errors are firstly processed instead of directly reporting the faults, so that the second type of errors can be allowed to occur in the computer system, the fault tolerance of the computer system can be improved, and the stability of the computer system is improved.
In addition, according to the embodiment of the application, after the computer node stops running, the task state of the computer node can be recovered according to the fine granularity of the function level, and redundant computation and network transmission are not needed. The embodiment of the application also supports the switching of the main and standby nodes.
The computer system and the fault tolerance method provided by the present application are described above, and the apparatus and the electronic device provided by the embodiments of the present application are described below.
Fig. 5 is a fault tolerant apparatus 500 according to an embodiment of the present application, which is applied to a computer system including a plurality of computer nodes, where the apparatus 500 includes a detection module 501 and a processing module 502.
The detecting module 501 is configured to, in a process of running a task by a computer node, determine a type of an error when it is detected that the computer node runs in an error.
A processing module 502, configured to stop running the task when the error is a first type of error, where the first type of error is an unrecoverable error; and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
In particular, the processing module 502 is further configured to detect whether first information is received, where the first information is used to preset a processing mechanism; when the first information is received, processing the error according to the first information.
Specifically, the processing mechanisms are divided into a plurality of levels according to the time urgency, and the first information is used to select one level of processing mechanism from the plurality of levels of processing mechanisms as a preset processing mechanism.
In particular, the first information is received when the computer node receives a task processing request.
In particular, the processing module 502 is also configured to restart the computer node after the computer node stops running the task.
In particular, the processing module 502 is further configured to obtain a running result that is stored by the computer node for the last time before stopping running the task; and restarting the computer node according to the operation result stored for the last time before the computer node stops operating the task.
It should be understood that the apparatus 500 of the embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC), or a Programmable Logic Device (PLD), which may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The fault-tolerant method shown in fig. 4 may also be implemented by software, and when the fault-tolerant method shown in fig. 4 is implemented by software, the apparatus 500 and each module thereof may also be a software module.
Fig. 6 is a schematic structural diagram of an electronic device 600 according to an embodiment of the present application. As shown in fig. 6, the device 600 includes a processor 601, a memory 602, a communication interface 603, and a bus 604. The processor 601, the memory 602, and the communication interface 603 communicate with each other via the bus 604, or may communicate with each other via other means such as wireless transmission. The memory 602 is used for storing instructions and the processor 601 is used for executing the instructions stored by the memory 602. The memory 602 stores program codes 1021, and the processor 601 can call the program codes 1021 stored in the memory 602 to execute the fault tolerance method shown in fig. 4.
It should be understood that in the embodiments of the present application, the processor 601 may be a CPU, and the processor 601 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 602 may include both read-only memory and random access memory and provides instructions and data to the processor 601. The memory 602 may also include non-volatile random access memory. The memory 602 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).
The bus 604 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration the various busses are labeled in figure 6 as bus 604.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a Solid State Drive (SSD).
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A fault tolerance method applied to a computer system, the computer system comprising a plurality of computer nodes, the method comprising:
in the process of running the task by the computer node, judging the type of an error when detecting that the computer node runs the error;
stopping running the task when the error is a first type of error, wherein the first type of error is an unrecoverable error;
and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
2. The method according to claim 1, wherein when the error is of the second type, processing the error based on a preset processing mechanism comprises:
detecting whether first information is received, wherein the first information is used for presetting a processing mechanism;
and when first information is received, processing the error according to the first information.
3. The method according to claim 2, wherein the processing mechanisms are divided into a plurality of levels according to the time urgency, and the first information is used to select one level of processing mechanisms from among the plurality of levels of processing mechanisms as a preset processing mechanism.
4. The method of claim 2, wherein the first information is received when the computer node receives a task processing request.
5. The method according to any one of claims 1 to 4, further comprising:
and restarting the computer node after the computer node stops running the task.
6. The method of claim 5, wherein restarting the computer node after the computer node stops running the task comprises:
obtaining the operation result of the last time of storage before the computer node stops operating the task;
and restarting the computer node according to the operation result stored for the last time before the computer node stops operating the task.
7. A computer system, characterized in that the computer system is adapted to implement the method of any of claims 1 to 6.
8. A fault tolerant apparatus for use in a computer system, said computer system comprising a plurality of computer nodes, said apparatus comprising:
the detection module is used for judging the type of an error when detecting that the computer node has an error in running in the process of running the task by the computer node;
the processing module is used for stopping running the task when the error is a first type of error, and the first type of error is an unrecoverable error; and when the error is of a second type, processing the error based on a preset processing mechanism, wherein the error of the second type is a recoverable error.
9. An electronic device, comprising: a memory storing a computer program and a processor implementing the method of any one of claims 1 to 6 when the processor executes the computer program.
10. A computer-readable storage medium having stored thereon computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-6.
CN202111675346.8A 2021-12-31 2021-12-31 Fault tolerant method, computer system, apparatus, electronic device and storage medium Pending CN114327989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111675346.8A CN114327989A (en) 2021-12-31 2021-12-31 Fault tolerant method, computer system, apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111675346.8A CN114327989A (en) 2021-12-31 2021-12-31 Fault tolerant method, computer system, apparatus, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN114327989A true CN114327989A (en) 2022-04-12

Family

ID=81020316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111675346.8A Pending CN114327989A (en) 2021-12-31 2021-12-31 Fault tolerant method, computer system, apparatus, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114327989A (en)

Similar Documents

Publication Publication Date Title
CN107526659B (en) Method and apparatus for failover
JP6756924B2 (en) Blockchain-based consensus methods and devices
US6918051B2 (en) Node shutdown in clustered computer system
US11403227B2 (en) Data storage method and apparatus, and server
CN103201724B (en) Providing application high availability in highly-available virtual machine environments
US9189316B2 (en) Managing failover in clustered systems, after determining that a node has authority to make a decision on behalf of a sub-cluster
US11330071B2 (en) Inter-process communication fault detection and recovery system
WO2016202051A1 (en) Method and device for managing active and backup nodes in communication system and high-availability cluster
CN110830283B (en) Fault detection method, device, equipment and system
US20080288812A1 (en) Cluster system and an error recovery method thereof
CN107508694B (en) Node management method and node equipment in cluster
CN113253933B (en) Method, apparatus, and computer readable storage medium for managing a storage system
CN113434337B (en) Retry strategy control method and device and electronic equipment
JP2017502414A (en) System and method for supporting asynchronous calls in a distributed data grid
CN111510480A (en) Request sending method and device and first server
WO2021213171A1 (en) Server switching method and apparatus, management node and storage medium
CN112600690B (en) Configuration data synchronization method, device, equipment and storage medium
WO2020000316A1 (en) Fault tolerance processing method, device, and server
CN114327989A (en) Fault tolerant method, computer system, apparatus, electronic device and storage medium
CN113596195B (en) Public IP address management method, device, main node and storage medium
US20060248531A1 (en) Information processing device, information processing method and computer-readable medium having information processing program
CN114116203A (en) Resource calling control method, resource calling control device and storage medium
CN107783855B (en) Fault self-healing control device and method for virtual network element
CN110022220B (en) Route activation method and system in business card recognition
CN112131201A (en) Method, system, equipment and medium for high availability of network additional storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination