CN115328741A

CN115328741A - Exception handling method, device, equipment and storage medium

Info

Publication number: CN115328741A
Application number: CN202211139403.5A
Authority: CN
Inventors: 伍冲斌; 孙嘉葳; 林帅浩; 胡冠杰
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-11-11

Abstract

The embodiment of the disclosure provides an exception handling method, an exception handling device, exception handling equipment and a storage medium. The method comprises the following steps: acquiring running index time sequence information corresponding to each node server in a node pool; detecting whether each node server meets an abnormal alarm condition or not based on the running index time sequence information, and determining abnormal node servers meeting the abnormal alarm condition and abnormal running indexes corresponding to the abnormal node servers; determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index, and determining a target repairing processing mode matched with the target abnormal type; and performing exception recovery on the abnormal node server based on the target recovery processing mode. By the technical scheme of the embodiment of the disclosure, automatic identification and automatic repair of the abnormal node server can be realized, and the efficiency and accuracy of abnormal processing are improved.

Description

Exception handling method, device, equipment and storage medium

Technical Field

The present disclosure relates to computer technologies, and in particular, to an exception handling method, an exception handling apparatus, an exception handling device, and a storage medium.

Background

With the rapid development of computer technology, more and more Service platforms, such as FaaS (Functions as a Service) platforms, are being developed. The FaaS platform directly deploys the program to the platform without being deployed to a physical machine, a virtual machine or a container, triggers the execution of a function task when an event arrives, quits the operation after the execution is finished, and releases occupied resources.

In an actual operation process, a node server in a service platform may have abnormal operation due to various reasons, for example, a Central Processing Unit (CPU) has an excessively high load and a disk occupies an excessively high disk, so that the node server may not normally execute an allocation task, resulting in a task execution failure. At present, the abnormal conditions are usually identified and repaired manually, time and labor are wasted, and the abnormal processing efficiency is greatly reduced.

Disclosure of Invention

The disclosure provides an exception handling method, an exception handling device, an exception handling apparatus and a storage medium, so as to realize automatic identification and automatic repair of an abnormal node server and improve exception handling efficiency and accuracy.

In a first aspect, an embodiment of the present disclosure provides an exception handling method, including:

acquiring operation index time sequence information corresponding to each node server in a node pool;

detecting whether each node server meets an abnormal alarm condition or not based on the running index time sequence information, and determining an abnormal node server meeting the abnormal alarm condition and an abnormal running index corresponding to the abnormal node server;

determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index, and determining a target repairing processing mode matched with the target abnormal type;

and performing exception repair on the abnormal node server based on the target repair processing mode.

In a second aspect, an embodiment of the present disclosure further provides an exception handling apparatus, including:

the operation index time sequence information acquisition module is used for acquiring operation index time sequence information corresponding to each node server in the node pool;

the abnormal alarm detection module is used for detecting whether each node server meets an abnormal alarm condition based on the running index time sequence information, and determining an abnormal node server meeting the abnormal alarm condition and an abnormal running index corresponding to the abnormal node server;

a target repair processing mode determining module, configured to determine, based on the abnormal operation index, a target abnormal type corresponding to the abnormal node server, and determine a target repair processing mode matched with the target abnormal type;

and the exception recovery module is used for performing exception recovery on the abnormal node server based on the target recovery processing mode.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, the one or more programs cause the one or more processors to implement the exception handling method according to any one of the embodiments of the present disclosure.

In a fourth aspect, the embodiments of the present disclosure further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used for executing the exception handling method according to any one of the embodiments of the present disclosure.

According to the embodiment of the disclosure, the operation index time sequence information corresponding to each node server in the node pool is obtained, and whether each node server meets the abnormal alarm condition is detected based on the operation index time sequence information, so that the abnormal node server meeting the abnormal alarm condition can be automatically identified, and the abnormal node server can be found in time. And determining a target abnormal type corresponding to the abnormal node server according to the abnormal operation index corresponding to the abnormal node server, and automatically and pertinently performing abnormal repair on the abnormal node server by using a target repair processing mode matched with the target abnormal type, so that the accuracy and efficiency of the abnormal repair are improved, and the normal operation of the node is ensured.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of an exception handling method provided in the embodiment of the present disclosure;

FIG. 2 is an exemplary diagram of an exception handling process to which embodiments of the present disclosure are directed;

FIG. 3 is a flow chart of another exception handling method provided by the embodiments of the present disclosure;

FIG. 4 is a schematic structural diagram of an exception handling apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a schematic flow diagram of an exception handling method provided in the embodiment of the present disclosure, and the embodiment of the present disclosure is applicable to a case of performing exception identification and repair on a node server, and in particular, may be applied to a case of performing exception handling on a node server in a FaaS platform. The FaaS platform may include a FaaS gateway, a FaaS control server, and a pool of nodes storing computing resources. The node pool comprises a plurality of node servers. And a user or event-driven trigger submits a function task to the FaaS gateway, after the function task is scheduled by the scheduling layer, the FaaS control server distributes the function task to a certain node server in the node pool to execute the function task, quits after the function is operated, and releases occupied resources. The exception handling method may be executed by an exception handling apparatus, which may be implemented in the form of software and/or hardware, and optionally, implemented by an electronic device, which may be a mobile terminal, a PC terminal, a server, or the like.

As shown in fig. 1, the exception handling method specifically includes the following steps:

s110, obtaining operation index time sequence information corresponding to each node server in the node pool.

Wherein, a node pool may refer to a group of node servers in a cluster having the same configuration. One or more node servers may be included in the node pool. And taking each node server as a monitoring object to perform abnormity monitoring. The operation index may be a parameter for characterizing an operation state of the node server. The number of the operation index may be one or more. For example, the operational metrics may include, but are not limited to: disk capacity, CPU load, CPU utilization rate and network card IO flow. The running index timing information may be index information of the same running index in the same node server at each running time, sorted by the running time. The operation index timing information can be used for representing the change condition of the operation index in a period of time. Each operation index has corresponding operation index timing information.

Specifically, each node server in the node pool may periodically collect operation index information corresponding to each operation index, and send the collected operation index information, collection timestamp, and node identification information corresponding to each operation index to the exception handling apparatus. The abnormity processing device performs information combination and time sequence processing on the operation index information of the same operation index in the same node server based on the received operation index information, the acquisition timestamp and the node identification information corresponding to each operation index, and obtains operation index time sequence information corresponding to each operation index in each node server. It should be noted that the operation index timing information may be dynamically updated along with the operation index information sent by the node server in real time, so as to store the operation index timing information in the latest period of time, thereby representing the current operation state of the node server more accurately.

And S120, detecting whether each node server meets an abnormal alarm condition or not based on the running index time sequence information, and determining the abnormal node servers meeting the abnormal alarm condition and the abnormal running indexes corresponding to the abnormal node servers.

The abnormal alarm condition may be a condition that needs to be alarmed when the index information is abnormal, which is set in advance based on the service requirement. For example, the abnormal alarm condition may be: index information in the continuous preset time period is larger than or equal to the preset information threshold value, namely the running states of the nodes in the continuous preset time period are in abnormal states. Each operation index can correspond to the same abnormal alarm condition and can also correspond to different abnormal alarm conditions. The abnormal node server may refer to a node server which is currently abnormal in operation. The abnormal operation index may refer to an operation index satisfying an abnormal alarm condition. The abnormal operation index can be used for representing an abnormal reason that the abnormal node server is in an abnormal operation state.

Specifically, for each node server, whether the operation index time sequence information corresponding to each operation index of the node server meets the corresponding abnormal alarm condition or not can be detected, if the condition that the abnormal alarm condition is met exists, the node server is determined to be an abnormal node server, and the operation index meeting the abnormal alarm condition is determined to be the abnormal operation index corresponding to the abnormal node server, so that the abnormal node can be automatically identified, manual participation is not needed, the abnormal node can be timely found, and the abnormal identification efficiency is greatly improved. And the abnormal recognition is carried out by utilizing the running index time sequence information, so that the follow-up unnecessary repairing operation caused by accidental shaking of the node can be avoided, the accuracy of the abnormal recognition can be improved, the error repairing is avoided, and the repairing resource is saved. By detecting the running index time sequence information in real time, abnormal nodes can be found more quickly, subsequent abnormal node repairing operation is triggered quickly, and the sensitivity of abnormal identification can be adjusted by adjusting abnormal alarm conditions based on actual business requirements.

It should be noted that if it is detected that a plurality of abnormal node servers currently exist, a plurality of repair processes may be created, and each abnormal node server may concurrently perform subsequent abnormal repair operations, so as to improve the abnormal repair efficiency and reduce adverse effects caused by node abnormalities.

S130, determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index, and determining a target repairing processing mode matched with the target abnormal type.

The target exception type may refer to a type to which the node server operation exception caused by the abnormal operation index belongs. For example, the target exception types may include: a response exception type, a memory exception type, and other exception types in addition to a response exception type and a memory exception type. The response exception type may refer to an exception condition that the node server has a slow response speed. The memory exception type may refer to an exception condition that the storage space of the node server is insufficient. Different operation indexes can correspond to the same exception type and can also correspond to different exception types. The operation index and the abnormality type may be in one-to-one correspondence or in one-to-many correspondence. For example, if the abnormal operation indexes are CPU load, CPU usage rate, and network card IO traffic, the target abnormal types corresponding to the abnormal operation indexes are response abnormal types. And if the abnormal operation index is the disk capacity, the target abnormal type corresponding to the abnormal operation index is a memory abnormal type. The repair processing mode may refer to an exception repair scheme of the node server. The exception types correspond to modification processing modes one to one. Different exception types can correspond to different modification processing modes so as to carry out exception repair more accurately.

Specifically, the abnormal type corresponding to the abnormal operation index may be determined as the target abnormal type corresponding to the abnormal node server based on a preset correspondence between the operation index and the abnormal type. The method may determine, based on a correspondence between the preset exception type and the repair processing manner, the repair processing manner corresponding to the target exception type as a target repair processing manner that matches the target exception type.

For example, the "determining a target repair processing manner matching the target exception type" in S130 may include: if the target exception type is the response exception type, determining that the target repair processing mode is to restart the abnormal node server; if the target exception type is the memory exception type, determining that the target repair processing mode is to perform directory cleaning operation on the abnormal node server; and if the target exception type is other exception types, determining that the target repair processing mode is the operation of downloading and reloading the exception node server and then uploading the exception node server.

Specifically, when the target exception type is a response exception type, the target exception type can be repaired in a manner of performing a restart operation on the abnormal node server. When the target exception type is a memory exception type, the target exception type can be modified in a mode of carrying out directory cleaning operation on a specified directory in the exception node server. When the target exception type is other exception types, the operation of downloading and reinstalling the exception node server and then uploading the exception node server can be repaired. And determining a matched target repairing processing mode in a targeted manner according to different target abnormal types so as to accurately repair the abnormal conditions.

And S140, performing exception repair on the abnormal node server based on the target repair processing mode.

Specifically, a target recovery processing mode matched with the target exception type is used for carrying out targeted exception recovery operation on the exception server. Different abnormal repairing operations are executed according to different abnormal types, so that modification can be completed more accurately and efficiently, and finally, the abnormal node server is ensured to be repaired in time, and adverse effects caused by abnormal nodes are reduced.

According to the technical scheme of the embodiment, the operation index time sequence information corresponding to each node server in the node pool is obtained, and whether each node server meets the abnormal alarm condition or not is detected based on the operation index time sequence information, so that the abnormal node servers meeting the abnormal alarm condition can be automatically identified, and the abnormal node servers can be found in time. And determining a target abnormal type corresponding to the abnormal node server according to the abnormal operation index corresponding to the abnormal node server, and automatically and pertinently performing abnormal repair on the abnormal node server by using a target repair processing mode matched with the target abnormal type, so that the accuracy and efficiency of the abnormal repair are improved, and the normal operation of the node is ensured.

On the basis of the above technical solution, S110 may include: acquiring operation index information, acquisition timestamps and node identification information corresponding to each operation index acquired by each node server in the node pool through a proxy process; and performing information combination and time sequence processing on the operation index information of the same operation index in the same node server based on the operation index information, the acquisition timestamp and the node identification information corresponding to each operation index, and determining the operation index time sequence information corresponding to each operation index in each node server.

In particular, FIG. 2 provides an exemplary diagram of an exception handling process. As shown in fig. 2, in the FaaS platform, an agent process may be deployed in each node server, and the agent process may periodically and regularly acquire operation index information (i.e., operation index values) corresponding to each operation index. And the agent process in each node server sends each acquired operation index information, the corresponding acquisition timestamp and the node identification information to the exception handling device. The abnormity processing device performs information combination and time sequence processing on the operation index information of the same operation index in the same node server based on the received operation index information, the acquisition timestamp and the node identification information corresponding to each operation index, and obtains operation index time sequence information corresponding to each operation index in each node server. The abnormity processing device carries out abnormity identification based on the operation index time sequence information corresponding to each node server, and can carry out timely abnormity repair on the abnormal node server which is in abnormity at present through the fas control server, thereby avoiding adverse effects caused by node abnormity.

On the basis of the foregoing technical solution, the "determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index" in S130 may include: if at least two abnormal operation indexes exist, determining a candidate abnormal type corresponding to each abnormal operation index based on the corresponding relation between the operation indexes and the abnormal types; if only one candidate abnormal type exists, determining the candidate abnormal type as a target abnormal type corresponding to the abnormal node server; and if at least two candidate exception types exist, determining a target exception type corresponding to the exception node server from the candidate exception types according to the exception processing priority corresponding to each candidate exception type.

The exception handling priority may be a priority for characterizing exception repair. For example, the higher the exception handling priority, the earlier the corresponding exception type is repaired. The exception handling priority may be set based on the severity of the operational impact of the exception type, e.g., in response to the exception type corresponding to an exception handling priority higher than the memory exception type. Alternatively, the exception handling priority may be set based on the repair degree of the exception handling manner corresponding to the exception type. For example, the abnormal type corresponding to the repair mode of offline reinstallation and online reinstallation is higher than the abnormal type corresponding to the restart repair mode, so that the existing abnormal reason needing to be restarted can be solved by executing the repair mode of offline reinstallation and online reinstallation, and other repair operations are not needed, so that the repair efficiency is improved.

Specifically, the abnormal operation index corresponding to the same node server may be one or more. If only one abnormal operation index exists currently, the abnormal type corresponding to the abnormal operation index can be directly determined as the target abnormal type corresponding to the abnormal node server. And if at least two abnormal operation indexes exist currently, determining the abnormal type corresponding to each abnormal operation index as a candidate abnormal type. Since different abnormal operation indexes may correspond to the same abnormal type, the number of the obtained candidate abnormal types may be one or more. If only one candidate exception type exists, the candidate exception type may be directly determined as the target exception type. If at least two candidate exception types exist, the exception handling priorities corresponding to each candidate exception type can be compared, and the candidate exception type with the highest exception handling priority is determined as the target exception type, so that the exception type with the most serious influence can be repaired preferentially.

Fig. 3 is a schematic flow chart of another exception handling method provided by the embodiment of the present disclosure, and the embodiment of the present disclosure optimizes the entire repair process of an abnormal node server on the basis of the foregoing embodiment. Wherein explanations of the same or corresponding terms as those used in the above-disclosed embodiments are omitted.

As shown in fig. 3, the exception handling method specifically includes the following steps:

s310, obtaining operation index time sequence information corresponding to each node server in the node pool.

S320, detecting whether each node server meets an abnormal alarm condition or not based on the running index time sequence information, and determining the abnormal node servers meeting the abnormal alarm condition and the abnormal running indexes corresponding to the abnormal node servers.

S330, determining a target exception type corresponding to the exception node server based on the exception operation index, and determining a target repair processing mode matched with the target exception type.

And S340, generating a node repairing instruction corresponding to the abnormal node server, and sending the node repairing instruction to the control server so that the control server stops performing task allocation operation on the abnormal node server based on the node repairing instruction.

The control server may be a server responsible for distributing the received task to a specific node server for task processing. A node repair instruction may refer to an instruction to initiate maintenance of a node.

Specifically, after the abnormal node server meeting the abnormal alarm condition is identified, a node repair instruction including the abnormal node identification information corresponding to the abnormal node server may be generated, and the node repair instruction may be sent to the control server. The control server may stop performing task allocation operation on the abnormal node server corresponding to the abnormal node identification information based on the abnormal node identification information in the node repair instruction, thereby avoiding scheduling a new task to the abnormal node server and further ensuring successful execution of the task.

It should be noted that, step S340 may be sequentially executed after step S330, or may be executed before step S330 and after the abnormal node server is determined in step S320, and the execution sequence of step S340 is not limited in this embodiment.

And S350, detecting whether the distributed tasks in the abnormal node servers are executed completely.

Specifically, after the new task is stopped being allocated to the abnormal node server, whether all tasks currently allocated in the abnormal node server are completely executed may be detected, for example, the current remaining task number in the node server may be detected, if the current remaining task number is zero, it is indicated that all the allocated tasks are completely executed, and if not, the processing unit waits until the tasks allocated in the abnormal node server are completely executed.

And S360, if the execution is finished, performing exception recovery on the abnormal node server based on the target recovery processing mode.

Specifically, when all the assigned tasks in the abnormal node server are executed, the abnormal node server is repaired based on the target repair processing mode, so that the influence caused by repair can be reduced as much as possible. Due to the fact that function tasks are temporary in the fas platform, certain tasks which can be normally executed in an abnormal node server can be guaranteed to be smoothly executed and completed by waiting for the completion of the execution of all tasks and then performing subsequent repair operation, node repair is more appropriate, and the influence on the executed tasks is reduced as far as possible.

And S370, detecting whether the abnormal node server is in a normal operation state or not based on the re-acquired operation index time sequence information corresponding to the abnormal node server.

Specifically, after the abnormal node server is abnormally repaired, whether the repaired abnormal node server recovers to a normal operation state may be detected based on the operation index timing information corresponding to the abnormal node server, which is obtained again, so as to determine whether the node server may be reused.

And S380, if the abnormal node server is in a normal operation state, generating a node recovery instruction corresponding to the abnormal node server, and sending the node recovery instruction to the control server, so that the control server continues to perform task allocation operation on the abnormal node server based on the node recovery instruction.

Specifically, if the repaired abnormal node server is in a normal operation state, it indicates that the repair is successful, and at this time, a node recovery instruction including the abnormal node identification information and corresponding to the abnormal node server may be generated, and the node recovery instruction is sent to the control server. The control server can continue to perform task allocation operation on the abnormal node server corresponding to the abnormal node identification information based on the abnormal node identification information in the node recovery instruction, so that the task can be rescheduled to the repaired node server, and the successful execution of the task is further ensured.

For example, if the repaired abnormal node server is still in the abnormal operation state, it indicates that the repair is failed, and at this time, the abnormal node server may be offline repaired, for example, a node offline instruction corresponding to the abnormal node server is sent to the control server, so that the control server performs offline processing on the abnormal node server based on the node offline instruction, and subsequent tasks are not distributed to the abnormal node server. After the abnormal node server is offline, the abnormal node server can be manually repaired, and after the abnormal node server is successfully repaired, the node server is online processed, so that the repaired node server can normally process tasks, and the successful execution of the tasks is guaranteed.

According to the technical scheme of the embodiment of the disclosure, after the abnormal node server of the abnormal alarm condition is identified, the node repairing instruction corresponding to the abnormal node server is sent to the control server, so that the control server stops task allocation operation on the abnormal node server, and therefore, a new task can be prevented from being dispatched to the abnormal node server before repairing. When the distributed tasks in the abnormal node servers are detected to be completely executed, the abnormal node servers are subjected to abnormal repair based on a target repair processing mode, so that the node repair is more suitable, and the influence on the running tasks is reduced as much as possible. When the repaired abnormal node server is detected to be in a normal running state, the node recovery instruction corresponding to the abnormal node server is sent to the control server, so that the control server continues to perform task allocation operation on the abnormal node server, a task can be scheduled to the repaired node server again, and the task is guaranteed to be executed successfully.

Fig. 4 is a schematic structural diagram of an exception handling apparatus according to an embodiment of the present disclosure, and as shown in fig. 4, the apparatus specifically includes: the system comprises an operation index timing information acquisition module 410, an abnormal alarm detection module 420, a target repair processing mode determination module 430 and an abnormal repair module 440.

The operation index timing information acquiring module 410 is configured to acquire operation index timing information corresponding to each node server in the node pool; an abnormal alarm detection module 420, configured to detect whether each node server satisfies an abnormal alarm condition based on the operation index timing information, and determine an abnormal node server satisfying the abnormal alarm condition and an abnormal operation index corresponding to the abnormal node server; a target repair processing mode determining module 430, configured to determine, based on the abnormal operation index, a target abnormal type corresponding to the abnormal node server, and determine a target repair processing mode matched with the target abnormal type; and an exception recovery module 440, configured to perform exception recovery on the abnormal node server based on the target recovery processing manner.

According to the technical scheme provided by the embodiment of the disclosure, the operation index time sequence information corresponding to each node server in the node pool is obtained, and whether each node server meets the abnormal alarm condition is detected based on the operation index time sequence information, so that the abnormal node server meeting the abnormal alarm condition can be automatically identified, and the abnormal node server can be found in time. And determining a target abnormal type corresponding to the abnormal node server according to the abnormal operation index corresponding to the abnormal node server, and automatically and pertinently performing abnormal repair on the abnormal node server by using a target repair processing mode matched with the target abnormal type, so that the accuracy and efficiency of abnormal repair are improved, and the normal operation of the node is ensured.

On the basis of the above technical scheme, the device further comprises:

and the node repairing instruction sending module is used for generating a node repairing instruction corresponding to the abnormal node server before the abnormal node server is subjected to abnormal repairing based on the target repairing processing mode, and sending the node repairing instruction to the control server so that the control server stops performing task allocation operation on the abnormal node server based on the node repairing instruction.

On the basis of the above technical solutions, the exception recovery module 440 is specifically configured to:

detecting whether the execution of the distributed tasks in the abnormal node server is finished or not; and if the execution is finished, performing exception recovery on the abnormal node server based on the target recovery processing mode.

On the basis of the above technical solutions, the apparatus further includes:

the running state detection module is used for detecting whether the abnormal node server is in a normal running state or not based on the corresponding re-acquired running index time sequence information of the abnormal node server after the abnormal node server is abnormally repaired based on the target repairing processing mode;

and the node recovery instruction sending module is used for generating a node recovery instruction corresponding to the abnormal node server if the abnormal node server is in a normal running state, and sending the node recovery instruction to the control server so that the control server continues to perform task allocation operation on the abnormal node server based on the node recovery instruction.

On the basis of the above technical solutions, the index timing information acquisition module 410 is specifically configured to:

acquiring operation index information, acquisition timestamps and node identification information corresponding to each operation index acquired by each node server in the node pool through a proxy process; and performing information combination and time sequence processing on the operation index information of the same operation index in the same node server based on the operation index information, the acquisition timestamp and the node identification information corresponding to each operation index, and determining the operation index time sequence information corresponding to each operation index in each node server.

On the basis of the above technical solutions, the target repair processing manner determining module 430 includes: the target exception type determining unit is specifically configured to:

if at least two abnormal operation indexes exist, determining a candidate abnormal type corresponding to each abnormal operation index based on the corresponding relation between the operation indexes and the abnormal types;

if only one candidate abnormal type exists, determining the candidate abnormal type as a target abnormal type corresponding to the abnormal node server;

and if at least two candidate exception types exist, determining a target exception type corresponding to the exception node server from the candidate exception types according to the exception handling priority corresponding to each candidate exception type.

On the basis of the above technical solutions, the target repair processing manner determining module 430 further includes: the target repair processing mode determining unit is specifically configured to:

if the target exception type is a response exception type, determining that a target repair processing mode is to restart the abnormal node server;

if the target exception type is the memory exception type, determining that a target repair processing mode is a directory cleaning operation on the abnormal node server;

and if the target exception type is other exception types, determining that the target repair processing mode is the operation of downloading, reinstalling and then uploading the exception node server.

The exception handling device provided by the embodiment of the disclosure can execute the exception handling method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects for executing the exception handling method.

It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 5, a schematic diagram of an electronic device (e.g., the terminal device or the server in fig. 5) 500 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An editing/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the exception handling method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment and the above embodiment have the same beneficial effects.

The disclosed embodiments provide a computer storage medium on which a computer program is stored, which when executed by a processor implements the exception handling method provided by the above-described embodiments.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring operation index time sequence information corresponding to each node server in a node pool; detecting whether each node server meets an abnormal alarm condition or not based on the running index time sequence information, and determining an abnormal node server meeting the abnormal alarm condition and an abnormal running index corresponding to the abnormal node server; determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index, and determining a target repairing processing mode matched with the target abnormal type; and performing exception repair on the abnormal node server based on the target repair processing mode.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first obtaining unit may also be described as a "unit obtaining at least two internet protocol addresses".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure [ example one ] there is provided an exception handling method comprising:

acquiring running index time sequence information corresponding to each node server in a node pool;

According to one or more embodiments of the present disclosure [ example two ] there is provided an exception handling method, further comprising:

optionally, before performing the exception recovery on the abnormal node server based on the target recovery processing manner, the method further includes:

and generating a node repairing instruction corresponding to the abnormal node server, and sending the node repairing instruction to a control server so that the control server stops performing task allocation operation on the abnormal node server based on the node repairing instruction.

According to one or more embodiments of the present disclosure, [ example three ] there is provided an exception handling method, further comprising:

optionally, the performing, on the basis of the target repair processing manner, the abnormal node server for abnormal repair includes:

detecting whether the execution of the distributed tasks in the abnormal node server is finished;

and if the execution is finished, performing exception recovery on the abnormal node server based on the target recovery processing mode.

According to one or more embodiments of the present disclosure, [ example four ] there is provided an exception handling method, further comprising:

optionally, after performing exception recovery on the abnormal node server based on the target recovery processing manner, the method further includes:

detecting whether the abnormal node server is in a normal operation state or not based on the re-acquired operation index time sequence information corresponding to the abnormal node server;

and if the abnormal node server is in a normal running state, generating a node recovery instruction corresponding to the abnormal node server, and sending the node recovery instruction to the control server so that the control server continues to perform task allocation operation on the abnormal node server based on the node recovery instruction.

According to one or more embodiments of the present disclosure [ example five ] there is provided an exception handling method, further comprising:

optionally, the obtaining of the operation index timing information corresponding to each node server in the node pool includes:

acquiring operation index information, acquisition timestamps and node identification information corresponding to each operation index acquired by each node server in the node pool through a proxy process;

and performing information combination and time sequence processing on the operation index information of the same operation index in the same node server based on the operation index information, the acquisition timestamp and the node identification information corresponding to each operation index, and determining the operation index time sequence information corresponding to each operation index in each node server.

According to one or more embodiments of the present disclosure, [ example six ] there is provided an exception handling method, further comprising:

optionally, the determining, based on the abnormal operation index, a target abnormal type corresponding to the abnormal node server includes:

if only one candidate exception type exists, determining the candidate exception type as a target exception type corresponding to the exception node server;

According to one or more embodiments of the present disclosure [ example seven ] there is provided an exception handling method, further comprising:

optionally, the determining a target repair processing manner matched with the target exception type includes:

if the target exception type is a response exception type, determining that a target repair processing mode is to restart the exception node server;

if the target exception type is a memory exception type, determining that a target repair processing mode is to perform directory cleaning operation on the abnormal node server;

According to one or more embodiments of the present disclosure [ example eight ] there is provided an exception handling apparatus comprising:

the target repair processing mode determining module is used for determining a target abnormal type corresponding to the abnormal node server based on the abnormal operation index and determining a target repair processing mode matched with the target abnormal type;

and the abnormal node server is used for performing abnormal repair on the abnormal node server based on the target repair processing mode.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or equivalents thereof without departing from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. An exception handling method, comprising:

2. The exception handling method according to claim 1, further comprising, before performing the exception repair on the abnormal node server based on the target repair handling manner:

3. The exception handling method according to claim 2, wherein said performing exception repair on the abnormal node server based on the target repair handling manner includes:

4. The exception handling method according to claim 2, further comprising, after performing the exception repair on the abnormal node server based on the target repair handling manner:

and if the abnormal node server is in a normal operation state, generating a node recovery instruction corresponding to the abnormal node server, and sending the node recovery instruction to the control server so that the control server continues to perform task allocation operation on the abnormal node server based on the node recovery instruction.

5. The exception handling method according to claim 1, wherein obtaining the operation index timing information corresponding to each node server in the node pool includes:

6. The exception handling method according to claim 1, wherein the determining, based on the abnormal operation index, a target exception type corresponding to the abnormal node server includes:

7. The exception handling method according to any one of claims 1 to 6, wherein said determining a target repair handling style matching the target exception type comprises:

and if the target exception type is other exception types, determining that the target repair processing mode is to perform offline reinstallation and then online operation on the exception node server.

8. An exception handling apparatus, comprising:

an abnormal alarm detection module, configured to detect whether each node server satisfies an abnormal alarm condition based on the operation index timing information, and determine an abnormal node server satisfying the abnormal alarm condition and an abnormal operation index corresponding to the abnormal node server;

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the exception handling method of any one of claims 1 to 7.

10. A storage medium containing computer-executable instructions for performing the exception handling method of any one of claims 1 to 7 when executed by a computer processor.