WO2023103344A1

WO2023103344A1 - Data processing method and apparatus, device, and storage medium

Info

Publication number: WO2023103344A1
Application number: PCT/CN2022/100708
Authority: WO
Inventors: 陈鉴镔; 杨军; 卢道和; 陈刚; 程志峰; 朱嘉伟; 罗海湾; 李勋棋; 汪晓雪; 周琪; 郭英亚; 李兴龙; 胡仲臣; 周佳振; 文玉茹; 何勇彬
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2021-12-08
Filing date: 2022-06-23
Publication date: 2023-06-15
Also published as: CN114157553A

Abstract

The present application discloses a data processing method and apparatus, a device, and a storage medium. The method comprises: obtaining alarm information of at least two node devices comprised in the data processing system; determining at least one anomaly subsystem on the basis of the at least two node devices; performing first processing on each anomaly subsystem in the at least one anomaly subsystem to obtain alarm information of the at least one anomaly subsystem, the first processing comprising: converging the alarm information of at least one node device comprised in the anomaly subsystem as alarm information of the anomaly subsystem; and correspondingly displaying the alarm information of the anomaly subsystem for each anomaly subsystem in the at least one anomaly subsystem. According to the solution, the anomaly subsystem and the corresponding alarm information can be clearly displayed, so that fault positioning efficiency is improved.

Description

A data processing method, device, equipment and storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111491784.9 and a filing date of December 8, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated into this application by reference.

technical field

This application relates to the technical field of data processing, involving but not limited to data processing methods, devices, equipment and storage media.

Background technique

With the rapid development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology (Fintech). However, due to the security and real-time requirements of the financial industry, more and more technical requirements high demands. With the continuous development of network technology, there are more and more devices in a network system (also called a data processing system), and the relationship between devices is becoming more and more complex. When a device failure occurs in a data processing system, since a device failure may cause other related devices to also fail, how to locate the fault is particularly important in a more complex data processing system.

In related technologies, distributed monitoring (Zabbix) and open source monitoring (open-falcon) are generally used to monitor the system. Specifically, an agent (agent) process is deployed on each monitored device, and the monitored device collects alarms through the agent. Information, and report the alarm information to the proxy (proxy) through Zabbix and open-falcon, and the proxy then reports the alarm information to the monitoring device (server) for summary, and then can display the alarm information in the instance dimension on the monitoring device.

However, if the alarm information of each monitored device is listed only with the device as the processing, in the case of a large number of alarms, an alarm storm may occur, making it difficult to locate the fault.

Contents of the invention

The present application provides a data processing method, device, device, and storage medium. The solution can clearly display abnormal subsystems and corresponding alarm information, thereby improving the efficiency of fault location.

The technical scheme of the present application is realized like this:

The present application provides a data processing method, the method is applied to a control device in a data processing system, the data processing system further includes a node device, and the method includes:

Acquiring alarm information of at least two node devices included in the data processing system;

determining at least one abnormal subsystem based on the at least two node devices;

Executing a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first processing includes: the abnormal subsystem included The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem;

For each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.

The present application provides a data processing device, the device is deployed in a control device in a data processing system, the data processing system further includes a node device, and the device includes:

an obtaining unit configured to obtain alarm information of at least two node devices included in the data processing system;

a determining unit configured to determine at least one abnormal subsystem based on the at least two node devices;

A processing unit configured to execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain alarm information of the at least one abnormal subsystem; the first process includes: converting the The alarm information of at least one node device included in the abnormal subsystem is converged into the alarm information of the abnormal subsystem;

The display unit is configured to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem.

The present application also provides an electronic device, including: a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above data processing method when executing the program.

The present application also provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above data processing method is realized.

The data processing method, device, device, and storage medium provided in the present application include: acquiring alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices ; Execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first process includes: the abnormal subsystem includes The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem; for each abnormal subsystem in the at least one abnormal subsystem, the alarm information of the abnormal subsystem is correspondingly displayed. This solution can converge the alarm information in the dimension of equipment to the alarm information in the dimension of system; and display correspondingly in the dimension of system. In this way, according to the displayed content, it is possible to clearly know which subsystem is abnormal and which subsystem Normal, improving the efficiency of locating faulty subsystems.

Description of drawings

FIG. 1 is an optional structural schematic diagram of a data processing system provided in an embodiment of the present application;

FIG. 2 is an optional schematic flowchart of a data processing method provided in an embodiment of the present application;

Fig. 3 is an optional schematic flow chart of the data processing method provided by the embodiment of the present application

FIG. 4 is an optional schematic flowchart of a data processing method provided in an embodiment of the present application;

FIG. 5 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;

FIG. 6 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;

FIG. 7 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;

Fig. 8 is an optional structural schematic diagram of the actuator provided by the embodiment of the present application;

FIG. 9 is an optional structural schematic diagram of a data processing device provided in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an optional electronic device provided in an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the specific technical solutions of the application will be further described in detail below in conjunction with the drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but not to limit the scope of the present application.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

In the following description, the term "first\second\third" is used as an example to distinguish different objects, and does not represent a specific order for the objects, and does not have a limitation on the sequence. It can be understood that "first\second\third" can be interchanged in a specific order or sequential order if allowed, so that the embodiments of the application described here can be used in a manner other than what is illustrated or described here implemented sequentially.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

Embodiments of the present application may provide a data processing method, device, device, and storage medium. In practical application, the data processing method can be realized by the data processing device, and each functional entity in the data processing device can control the hardware resources of the equipment, such as computing resources such as processors, communication resources (such as used to support the realization of various mode of communication) collaborative implementation.

The data processing method provided in the embodiment of the present application is applied to a data processing system, and the data processing system includes a control device and a node device. The control device executes: acquiring alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices; The abnormal subsystem executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converging the alarm information of at least one node device included in the abnormal subsystem into the abnormal Alarm information of subsystems: for each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.

As an example, the structure of the data processing system 10 may be as shown in FIG. 1 , including: a control device 101 , a node device 102 and a network 103 . Wherein, the control device 101 can communicate with the node device 102 through the network 103 .

The control device 101 is configured to: acquire alarm information of at least two node devices included in the data processing system; determine at least one abnormal subsystem based on the at least two node devices; Each of the abnormal subsystems executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converging the alarm information of at least one node device included in the abnormal subsystem into The alarm information of the abnormal subsystem: for each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.

The node device 102 may be a hardware device such as a server or a router, or may also be a virtual device such as a virtual machine or a container.

The network 103 is used for communication between the control device 101 and the node device 102 . Wherein, the network 103 may include a limited network, a wireless network, and the like.

It should be noted that the embodiment of the present application does not specifically limit the number of node devices 102 and the number of control devices 101 in the data processing system, which can be configured according to actual needs. In an example, there may be one control device 101 , and there may be multiple node devices 102 . Wherein, multiple node devices 102 belong to different subsystems.

Below, with reference to the schematic diagram of the data processing system shown in FIG. 1 , various embodiments of the data processing method, device, device, and storage medium provided by the embodiments of the present application will be described.

In a first aspect, the embodiment of the present application provides a data processing method, which is applied to a data processing device; wherein, the data processing device can be deployed in the control device 101 in FIG. 1 . Next, the data processing process provided by the embodiment of the present application will be described.

FIG. 2 shows a schematic flowchart of an optional data processing method. The data processing method provided in the embodiment of the present application may include but not limited to S201 to S204 shown in FIG. 2 .

S201. The control device acquires alarm information of at least two node devices included in the data processing system.

The data processing system includes a plurality of node devices, at least two of the plurality of node devices are abnormal, and the control device acquires alarm information of the at least two abnormal node devices.

It should be noted that the at least two abnormal node devices may be part or all of multiple node devices included in the data processing system. The embodiment of this application does not specifically limit the specific number of node devices that have alarm information, and can be configured according to actual needs.

The alarm information of the node device is used to indicate that the node device is abnormal. The embodiment of the present application does not limit the specific content of the alarm information of the node device, which can be configured according to actual needs.

In a possible implementation manner, the alarm information of the node device may include at least one first alarm level and a first alarm quantity corresponding to the at least one first alarm level. For example, the alarm information of the node device may include: critical (Critical) alarm, number 1 corresponding to the critical alarm; major (Major) alarm, number 2 corresponding to the major alarm; and minor (Minor) alarm, number 2 corresponding to the minor alarm Corresponding quantity 3. Among them, the number 1 corresponding to the emergency alarm indicates that the number of emergency alarms is 1; the number 2 corresponding to the main alarm indicates that the number of major alarms is 2; the number 3 corresponding to the minor alarm indicates that the number of minor alarms is 3.

In a possible implementation manner, the alarm information of the node device may further include: an alarm type corresponding to at least one first alarm level, an alarm time corresponding to at least one first alarm level, and an alarm time corresponding to at least one first alarm level. Alarm logs corresponding to one alarm level and so on.

Implementation of S201 may include: the control device receives alarm logs reported by at least two node devices, and the control device processes the alarm logs of the node devices for each of the at least two node devices to obtain alarm information of the node devices.

S202. The control device determines at least one abnormal subsystem based on the at least two node devices.

The control device determines the abnormal subsystem to which at least two node devices with alarm information belong to as the at least one abnormal subsystem. Among them, at least two node devices may belong to one abnormal subsystem, or may belong to multiple abnormal subsystems.

S203. The control device executes first processing for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain alarm information of the at least one abnormal subsystem.

Wherein, the first processing includes: converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem.

The alarm information of the subsystem is used to indicate the abnormality of the subsystem. The embodiment of the present application does not limit the specific content of the alarm information of the subsystem, which can be configured according to actual requirements.

In a possible implementation manner, the alarm information of the subsystem may include at least one second alarm level and a second alarm number corresponding to the at least one second alarm level. For example, the alarm information of the subsystem may include: critical (Critical) alarm, the number 4 corresponding to the critical alarm; major (Major) alarm, the number 5 corresponding to the major alarm; minor (Minor) alarm, corresponding to the minor alarm The number of 6.

In a possible implementation manner, the alarm information of the subsystem may further include: an alarm type corresponding to at least one second alarm level one-to-one, an alarm time corresponding to at least one second alarm level one-to-one, and an alarm time corresponding to at least one second alarm level. The two alarm levels correspond to node devices and so on.

The control device executes the first process for each abnormal subsystem in the at least one abnormal subsystem, and converges the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem; thereby obtaining at least one abnormal Subsystem alarm information.

S204. For each abnormal subsystem in the at least one abnormal subsystem, the control device correspondingly displays the alarm information of the abnormal subsystem.

The embodiment of the present application does not specifically limit the manner of correspondingly displaying the abnormal subsystem and the alarm information of the abnormal subsystem, which may be configured according to actual requirements.

In a possible implementation manner, it may be displayed in the form of a diagram. Correspondingly, the implementation of S204 may include: displaying at least one abnormal subsystem in the data processing system in the figure, and displaying the alarm information of the abnormal subsystem at a corresponding position of each abnormal subsystem. Wherein, in this implementation manner, the display may be performed on the basis of the original system diagram (such as a flow chart, a structural diagram), or may be displayed by creating a new diagram.

In another possible implementation manner, it may be displayed in a textual manner. Correspondingly, the implementation of S204 may include: correspondingly displaying at least one abnormal subsystem in the data processing system and alarm information of the abnormal subsystem in a textual manner.

The data processing scheme provided by the embodiment of the present application includes: obtaining the alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices; Each of the abnormal subsystems in the subsystems executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: the alarm information of at least one node device included in the abnormal subsystem The information converges to the alarm information of the abnormal subsystem; for each abnormal subsystem in the at least one abnormal subsystem, the alarm information of the abnormal subsystem is correspondingly displayed. This solution can converge the alarm information in the dimension of equipment to the alarm information in the dimension of subsystem; and display correspondingly in the dimension of subsystem. In this way, according to the displayed content, it can be clearly obtained which subsystem is abnormal and which subsystem is abnormal. The system is normal, which improves the efficiency of locating the faulty subsystem.

Next, a process of S201 for the control device to acquire alarm information of at least two node devices included in the data processing system will be described. Wherein, the process of the control device acquiring the alarm information of each node device is similar, and the process is described by taking one node device as an example. This process may include but not limited to the following S2011 to S2013.

S2011. The control device processes the alarm log reported by the node device to obtain a keyword field representing an alarm prompt.

The control device cuts the information in the alarm log according to the cutting specification to obtain multiple cut fields, determines the keyword field in the multiple cut fields, and extracts the content of each information keyword field (alarm prompt).

It should be noted that the cutting rules need to correspond to specific log specifications. For example: a log specification distinguishes various information (for example, including alarm device, alarm time, and alarm level) with spaces as intervals, and the cutting rule is to cut with spaces as boundaries to obtain alarm devices, alarm time, and alarm levels.

S2012. The control device determines M first warning levels for the node device based on the warning prompt represented by the keyword field, and M first warning quantities corresponding to the M first warning levels one-to-one.

Wherein, M is greater than or equal to 1.

The control device sorts the alarm prompts according to the number of occurrences in the log in descending order, determines the top K alarm prompts, determines the corresponding alarm levels according to the alarm prompts, and obtains M first alarm levels; counts each The number of first alarms corresponding to the first alarm level.

Wherein, K is greater than or equal to M. That is, one alarm level can correspond to one or more types of alarm prompts. For example, the alarm prompt "JDBC exception" corresponds to a major alarm; the alarm prompt "memory exception" also corresponds to a major alarm.

S2013. The control device updates a first alarm dictionary for the node device based on the M first alarm levels and the number of the M first alarm levels, to obtain alarm information of the node device.

Wherein, N is greater than or equal to M.

The first warning dictionary includes N first warning levels and N first preset warning numbers corresponding to the N first warning levels.

The embodiment of the present application does not specifically limit the expression form of the first alarm dictionary, which may be configured according to actual requirements. Exemplarily, the first alarm dictionary can be configured as: {'minor': 0; 'major': 0; 'critical': 0}.

Exemplarily, the obtained alarm information of the node device (the container whose IP is 8.8.8.8) includes: {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, the M first warning levels are 'major', and the number of M first warning levels is 1.

Based on the M first warning levels and the number of first warning levels corresponding to the M first warning levels, the control device updates the first warning dictionary for the node device to obtain the warning information of the node device.

In this way, since the form of the first alarm dictionary for the node device is preset, when obtaining the alarm information of the device, it is only necessary to update the alarm dictionary according to the specific alarm content, which is simple and clear.

Next, in S203, the process of the control device converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem will be described. This process may include but not limited to the following S2031 to S2033.

S2031. The control device obtains alarm information of at least one node device included in the abnormality subsystem.

Wherein, the alarm information of the node device includes M first alarm levels, and M first alarm numbers corresponding to the M first alarm levels one-to-one.

The control device obtains the alarm information of the node devices included in the abnormal subsystem according to the identifiers of the node devices included in the abnormal subsystem. Wherein, the exception subsystem may include one or more node devices.

S2032. The control device determines P second alarm levels for the abnormal subsystem based on the M first alarm levels of each of the node devices in the at least one node device and the number of the M first alarms , and P second alarm numbers corresponding to the P second alarm levels one-to-one.

Wherein, P is greater than or equal to M.

The control device determines the union of the M first alarm levels of each node device in the at least one node device as P second alarm levels; for each second alarm level in the P second alarm levels, the control device will In each node device of the at least one node device, the sum of the first alarm numbers of the first alarm level having the same content as the second alarm level is used as the second alarm number corresponding to the second alarm level.

S2033. Based on the P second alarm levels and the P second alarm numbers, the control device updates a second alarm dictionary for the abnormal subsystem to obtain alarm information of the abnormal subsystem.

Wherein, Q is greater than or equal to P.

The second alarm dictionary includes Q second alarm levels and a second preset number of alarms corresponding one-to-one to the Q second alarm levels.

The embodiment of the present application does not specifically limit the expression form of the second alarm dictionary, which may be configured according to actual requirements. Exemplarily, the second alarm dictionary can be configured as: {'minor': 0; 'major': 0; 'critical': 0}.

For example, the abnormality subsystem 1 includes two node devices, node device 1 and node device 2 respectively. Among them, the alarm information of node device 1 includes {'minor': 0, 'major': 1, 'critical': 0}; the alarm information of node device 2 includes {'minor': 0, 'major': 1,' critical': 1}; the alarm information of abnormal subsystem 1 after convergence includes: {'minor': 0, 'major': 2, 'critical': 1}.

In this way, since the form of the second alarm dictionary for the subsystem is preset, when obtaining the alarm information of the equipment, it is only necessary to update the second alarm dictionary according to the content in the alarm dictionary of each node device, that is, to update the alarm information of the subsystem. Level, and the number of alarms under each alarm level, the implementation is simple and clear.

Next, the process of S204 for the control device to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem will be described. This process may include but not limited to the following Embodiment A1 or Embodiment A2.

Embodiment A1, displaying in the form of a diagram;

Embodiment A2, displaying in the form of text.

Embodiment A1 may include but not limited to the following S2041 to S2043.

S2041. The control device obtains a transaction flow chart; the transaction flow chart shows the at least one abnormal subsystem.

In a possible implementation manner, S2041 may be implemented as: the control device draws the system flow involved in the transaction process by means of graphviz drawing to obtain a transaction flow chart.

In another possible implementation manner, the transaction flow chart is pre-drawn and stored at a fixed location, and S2041 may be implemented as: acquiring the transaction flow chart at a fixed location where the control device stores the transaction flow chart.

S2042. In the transaction flowchart, the control device, for each abnormal subsystem in the at least one abnormal subsystem, determines the subsystem node to which the abnormal subsystem belongs in the transaction flowchart.

For each abnormal subsystem in the at least one abnormal subsystem, the control device determines the subsystem node to which the abnormal subsystem belongs in the transaction flow diagram according to the node identification of each subsystem in the transaction flow diagram.

S2043. The control device displays the alarm information of the abnormal subsystem corresponding to the subsystem node.

The embodiment of the present application does not specifically limit the specific manner in which the corresponding subsystem node of the control device displays the alarm information of the abnormal subsystem, and may be configured according to actual requirements. For example, alarm information of abnormal subsystems may be displayed at preset positions (left, right, upper, lower, etc.) of the subsystem nodes.

Embodiment A2 may include: correspondingly displaying at least one abnormal subsystem in the data processing system and alarm information of the abnormal subsystem in a textual manner.

Adopting the embodiment A1, displaying the alarm information of the abnormal subsystem in the form of a graph can not only clarify the alarm information of each abnormal subsystem, but also clarify the logical relationship between the abnormal subsystem and other subsystems in the system.

Displaying in the form of text is easy to implement because of the small amount of data and good compatibility of the text information.

In the data processing method provided by the embodiment of the present application, after the abnormal subsystem is determined, further processing may be performed on the abnormal subsystem. Among them, the processing process of each exception subsystem is similar, and an exception subsystem is taken as an example to illustrate.

As shown in Fig. 3, the process may include but not limited to the following S205 to S207.

S205. The control device determines that a node device meeting the first condition in the abnormal subsystem is a node device to be processed.

In a possible implementation manner, if only one node device is abnormal in the abnormality subsystem, that is, only one node device has alarm information, this one node device is determined as the node device to be processed.

In another possible implementation manner, if there are abnormalities in multiple node devices in an abnormality subsystem, that is, multiple node devices have alarm information, all the multiple node devices are determined as node devices to be processed.

S206. The control device judges whether the node device to be processed is faulty.

The embodiment of the present application does not specifically limit the manner of judging whether the node device to be processed is faulty, and may be configured according to actual requirements.

S207. When the node device to be processed fails, the control device sends a first instruction to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.

Among them, the corresponding operations can include: dumping memory, threads, and processes, restarting, isolating, auto-scaling, and so on.

In this way, in the case of a node device failure, the corresponding operation can be automatically executed, which improves the processing efficiency in the case of a failure.

Next, the process of determining by the control device in S205 that the node device satisfying the first condition in the abnormal subsystem is the node device to be processed will be described. This process may include but not limited to the following S2051 and S2052.

S2051. The control device determines the alarm score of each node device in the abnormal subsystem.

Exemplarily, the control device may determine the warning score of the node device through the following formula (1).

F＝Q(minor)×f1+Q(major)×f2+Q(critical)×f3 formula (1);

Among them, F indicates the alarm score of the node device; Q(minor) indicates the number of the first alarm corresponding to the minor alarm level; Q(major) indicates the number of the first alarm corresponding to the major alarm level; Q(critical) indicates the number of alarms corresponding to the critical alarm level The first alarm number; f1 indicates the alarm score corresponding to the minor alarm level; f2 indicates the alarm score corresponding to the major alarm level; f3 indicates the alarm score corresponding to the critical alarm level.

In one example, f1 may be 1 point, f2 may be 2 points, and f3 may be 4 points.

S2052. If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, the control device determines that the first node device is the node to be processed equipment.

Wherein, the first score threshold is greater than the second score threshold.

The first node device is any node device in the abnormal subsystem, and the second node device is a node device in the abnormal subsystem other than the first node device.

If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, the control device determines that the first node device is a node device to be processed; If the condition is the first node device, it is determined that there is no node device to be processed.

Whether a node device is a node device to be processed is determined by calculating the alarm score of the node device, which has the characteristics of simple and easy implementation.

Next, the process of determining whether the node device to be processed is faulty by the control device in S206 will be described. This process may include but not limited to the following S2061 to S2063.

S2061. The control device obtains at least one index of the node device to be processed.

The at least one indicator may include but not limited to at least one of the following: time delay, success rate, and transaction volume.

S2062. For each of the at least one indicator, the control device calculates a first distance between the indicator and a reference value of the indicator to obtain a judgment result of the indicator.

Wherein, if the first distance between the index and the index is greater than or equal to the first distance threshold, then determine that the judgment result of the index is abnormal; if the first distance between the index and the index is less than the first distance threshold, Then it is determined that the judgment result of the indicator is normal.

The embodiment of the present application does not limit the specific value of the first distance threshold, which may be configured according to actual requirements.

In a possible implementation manner, an isolated forest judger is established for all indicators in at least one indicator, and a judgment result of each of all indicators is obtained by running the isolated forest judger.

In another possible implementation manner, for each index in at least one index, an isolated forest judger is established, and the judgment result of the index is obtained by running the isolated forest judger corresponding to the index.

S2062 may be implemented as: for each index in at least one index, the control device calculates a first distance between the index and a reference value of the index, and obtains a judgment result of the index.

The embodiment of the present application does not limit the body value of the reference value of the index, which can be configured according to actual needs. For example, the reference value of the indicator can be determined according to the historical data of the indicator, or can be determined according to the empirical value.

The embodiment of the present application does not limit the specific manner of obtaining the judgment result, which can be configured according to actual needs. In a possible implementation manner, the judgment result of the index can be obtained according to the isolation forest judger.

Exemplarily, the control device inputs each index of at least one index into the isolated forest judger, runs the isolated forest judger, outputs the first value or the second value, and if the first value is output, then the judgment for the index is obtained by characterization The result is abnormal, and if the second value is output, it means that the judgment result for the index is normal.

S2063. If the determination result for each of the at least one indicator is normal, the control device determines that the node device to be processed is not faulty; otherwise, determines that the node device to be processed is faulty.

Judging whether the node equipment to be processed is faulty by means of the isolated forest judger is simple to implement and high in processing efficiency.

Optionally, after S2063, if the judgment result is abnormal, the data processing method provided by the embodiment of the present application can also correct the judgment result to improve the accuracy of the judgment. The correction process may include but not limited to the following embodiments B1 and embodiment B2.

Embodiment B1, correcting by the reference value of each index;

Embodiment B2, the correction is performed through the normal range of each index.

Embodiment B1 may include: when the indicator includes time delay, if the time delay of the node device to be processed is less than or equal to the time delay reference value, the control device modifies the judgment result of the time delay to be normal.

In the case that the indicator includes a success rate, if the success rate of the node device to be processed is greater than or equal to the success rate reference value, the judgment result for the success rate is modified to be normal.

The embodiment of the present application does not limit the specific values of the delay reference value and the success rate reference value, which may be configured according to actual requirements.

In an example, the delay reference value and the success rate reference value may be determined according to historical data or experience values.

Using the implementation mode B1 to correct the reference value of each index, the judgment node can be corrected according to the specific characteristics of each index, which has the characteristics of high accuracy and high flexibility.

Embodiment B2 may include: the control device determines the normal range of the index based on historical normal index data; the control device judges whether the index of the node device to be processed belongs to the normal range; if the index of the node device to be processed belongs to the normal range, the control device modifies the judgment result as normal.

The embodiment of the present application does not specifically limit the manner of determining the normal range, which may be configured according to actual requirements.

In an example, a normal range for an indicator can be determined based on historical data.

For each index, the control device determines a normal range for each index.

The implementation mode B2 is used to correct the normal range of each index, since the normal range of the index is determined according to the normal data of the historical index, so the adaptability is strong.

In the following, the data processing method provided by the embodiment of the present application will be described by taking the node device as an instance device (also referred to as an instance or monitored device) and the control device as a monitoring device as an example.

In order to facilitate the understanding of the following embodiments, some technical terms are briefly explained.

Dump can be used to save relevant environment information and generate dump files. For example, it can be used to dump information such as memory, threads, and processes.

Isolation forest algorithm for outlier detection.

The golden index refers to the index that affects the reliability of the system. For example, gold indicators can include: success rate, latency, transaction volume, etc.

Sklearn Python, refers to a library related to machine learning.

Graphviz refers to a drawing tool.

In related technologies, Zabbix and open-falcon are generally used to monitor the system. Specifically, an agent (agent) process is deployed on each monitored device (instance), and the monitored device collects alarm information through the agent, and through Zabbix and open-falcon reports the alarm information to the proxy (proxy), and the proxy reports the alarm information to the monitoring device (server) for summary, and then displays the alarm information in the instance dimension on the monitoring device.

The monitoring equipment uses a single algorithm (such as standard deviation, decision tree, autoregressive integrated moving average model (ARIMA), long short-term memory artificial neural network (Long Short-Term Memory, LSTM) and other algorithms) or a combination of multiple algorithms (such as using integrated learning to form a voting machine, using the principle of minority obeying the majority) to check the indicators to determine the faulty equipment.

But in reality: First, a system generally includes multiple subsystems. The monitoring equipment does not know the logical relationship between the subsystems, and cannot summarize and count the alarm information of each subsystem, but only lists them one by one. For all alarm information, in this way, it is easy to generate an alarm storm, which affects the judgment of the operation and maintenance engineer.

The second point is that even if a single algorithm or a combination of multiple algorithms is used to check outliers, without further corrections, it is easy to cause misjudgment and consume a lot of energy.

The third point is that for certain faulty devices, it only plays the role of notification, and some simple faults, such as single instance faults, are not automatically processed, which affects the availability of the system.

The detection method provided in the embodiment of the present application can overcome the above-mentioned problems from the first point to the third point. As shown in FIG. 4 , it may specifically include but not limited to the following stages 1 to 3.

Phase 1. Collect alarm information and draw an alarm link diagram.

Stage 2. For the abnormal subsystem, determine the first instance in the abnormal subsystem, perform abnormal detection, and determine whether the first instance is faulty.

For the first instance, it can also be called a single instance failure instance, which is equivalent to the above-mentioned node device to be processed. Since it is impossible to finally confirm whether the instance is really faulty according to the alarm information, it needs to be judged again. This application can judge according to the isolated forest Secondary checks are performed on the device, human experience, and historical data to increase the accuracy of fault identification.

Stage 3. Send the detection result of the first instance to the corresponding executor, and the executor performs the corresponding operation on the first instance.

The operations corresponding to the row may include: saving the state, isolating traffic, restarting, etc.

Generally speaking, on the one hand, the alarm information of each instance is converged through the collected alarm information of each instance, the alarm information of the subsystem is obtained, and the alarm information of the subsystem is displayed; on the other hand, the alarm information of each instance collected and the The indicator information is used to perform anomaly detection on the first instance, and if the first instance is determined to be a faulty instance, corresponding operations are performed on the faulty instance by calling the executor.

In the following, the process of collecting alarm information and drawing an alarm link diagram in phase 1 will be described in detail.

When there is a problem with a subsystem, it is very easy to have an alarm storm, causing the monitoring system to be swiped, so that it is difficult to find the faulty device.

Therefore, this application first converges and summarizes the alarm information, and uses python3+graphviz to draw the alarm link diagram for the converged alarm information. In this way, it can be clearly seen which subsystem has an alarm.

Stage 1 may specifically include but not limited to the following steps A1 to A5.

Step A1, collecting alarm information of each instance.

Exemplarily, the alarm information of each instance that has existed in the last two hours is collected.

Alarm levels may include: notification (Info) alarm, warning (Warning) alarm, minor (Minor) alarm, major (Major) alarm, and critical (Critical) alarm. Among them, the alarm levels of Info alarms and Warning alarms are relatively low, so this application does not process Info alarms and Warning alarms, and only processes Minor alarms, Major alarms, and Critical alarms.

Specifically, the monitoring device cuts the information in the alarm log, extracts keywords (alarm types) in each information, and sorts each alarm type according to the number of occurrences in the log from large to small, and determines the top three Alarm type, to obtain the alarm levels of these three alarm types.

It should be noted that the cutting rules during cutting need to correspond to specific log specifications. For example: a log specification distinguishes various information (for example, including alarm device, alarm time, alarm level, and alarm type) with spaces as intervals, and the cutting rule is to cut with spaces as the boundary to obtain alarm devices, alarm time, and alarm levels and information corresponding to the alarm type.

Take the instance as the dimension, create a dictionary A of the instance, and initialize the number of times.

For example, dictionary A: {'minor': 0; 'major': 0; 'critical': 0}.

Example 1: Alarm statistics of instance 1 (container with IP 8.8.8.8): {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, keyword: ('JDBC exception': 1); wherein, JDBC exception indicates database exception, and 1 indicates that the number of exceptions is 1.

Alarm statistics of instance 2 (virtual machine with ip 7.7.7.7): {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, keyword: ('JDBC exception': 1).

Step A2: Count the alarm information of each instance using a prescribed counting method to obtain the alarm information of each subsystem.

Take the subsystem as the dimension, form a global dictionary B, pull all the alarms of all instances in the subsystem, and traverse them in a loop. For example, if there is a major alarm in instance 1 of subsystem A, it will match the field to subsystem A, find the major in the secondary dictionary of instance 1, and add 1 to it, and at the same time, in the "total" of the secondary dictionary Add 1 to the top. Traverse the alarms of all instances in subsystem A to get the alarms of subsystem A.

Based on example 1, example 2: subsystem_A: { total count: 2; instance 1: 'minor': 0, 'major': 1, 'critical': 0; instance 2: 'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, the keyword 'JDBC exception'.

Based on Example 1, Example 3: Subsystem_A: {Total Quantity: 2; Instance 1: 1, Instance 2: 1}. Correspondingly, the keyword 'JDBC exception'.

Step A3, obtaining the flow chart of the transaction.

The flow chart of each transaction needs to be written according to the logical relationship between the various subsystems in the call transaction process of the transaction.

As shown in Figure 5, the process of the transaction includes: request from the four-layer virtual server (Linux Virtual Server Linux, LVS) subsystem as the load balancing subsystem, to the seven-layer proxy service (Nginx) subsystem, and then to the internal The application programming interface (Application Programming Interface, API) gateway layer serves the front-end back-end (Backend For Frontend, BFF) subsystem, and the BFF subsystem forwards the request to the UM interface subsystem, and the UM interface subsystem sends the request to To the ACL subsystem or whitelist subsystem (whitelist).

It is understandable that when a subsystem fails, it may cause a large number of subsystems to generate alarms, resulting in an alarm storm. Therefore, it is necessary to quickly locate the corresponding fault cause from the flow chart.

Exemplarily, the flow chart can be drawn by means of graphviz drawing.

Step A4, display the alarm information of each subsystem in the flow chart, and obtain the alarm link diagram.

According to the name of the subsystem, the alarm information of each subsystem is shown in the flow chart, and the alarm link diagram is obtained.

For the situation where there are multiple key links and flow charts corresponding to multiple key links, the flow charts of multiple links can also be processed together as follows:

A core subsystem may have flow charts corresponding to multiple links.

Each key link has one or more core subsystems. When there is a problem with the upstream and downstream systems, the core subsystem will also have a problem and generate an alarm. For example, in Figure 5, UM is a core subsystem.

Among them, the key link may be a link with a large function as the dimension. Such as login link, authorization link, SMS sending link, etc.

Specifically include: sub-step 1, use all core subsystems as a key of a dictionary, when all core subsystems of this link have an alarm, select this link, and enter sub-step 2.

Sub-step 2. After executing sub-step 1, multiple links may be obtained, and the non-core subsystems appearing in this one or more links are matched with alarms. When there is an alarm in a certain non-core subsystem, select This link matches multiple links if the non-core subsystem appears in multiple links. Write the alarm summary result of the subsystem (equivalent to the alarm information of the abnormal subsystem) obtained by summarizing the alarm into a dictionary, and write it behind the corresponding subsystem, and then generate the alarm link diagram shown in Figure 6.

Step A5. Send the alarm link diagram to the alarm group.

Generate pictures through Graphviz, read the pictures with python, and convert the pictures into information in binary data (base64) format based on 64 printable characters, and call the message interface of the robot to send the alarm to the alarm group.

Next, in phase 2, anomaly detection is performed on the first instance to determine whether the first instance is faulty. Stage 2 may specifically include but not limited to the following steps B1 to B4.

Step B1. Determine the first instance.

The failure score of each instance is calculated, and the instance satisfying the preset condition is determined as the first instance.

For example, when minor-level alarms are worth 2 points, major-level warnings are worth 2 points, and critical-level warnings are worth 4 points, an instance score = number of minor alarms × 1 + number of major alarms × 2 + number of critical alarms Times × 4.

The preset condition is: the score of one instance in a system is greater than or equal to 4 points, and the scores of other instances are less than or equal to 1 point, then the instance whose score is greater than or equal to 4 points is determined to be the first instance.

Step B2. Determine whether the first instance is abnormal based on the isolation forest judger.

First, obtain the golden index of the first instance, classify the indexes, and obtain the transaction volume, time delay, and success rate of all instances.

For each indicator, a set of corresponding isolation forest discriminator models are trained.

The premise of the isolation forest judger is that all abnormalities are minority.

The principle of the isolation forest judger is to find a small number of outliers in a mathematical way.

Next, taking an index (time delay) as an example, the process of obtaining the isolation forest judger will be described.

The input data type of the isolation forest judger: continuous business indicator data

Training method: unsupervised learning, the result can be obtained by substituting data. But in order to adjust the model parameters, some labeled data is needed for verification.

Abnormal definition: Points that are easy to be isolated are points that are relatively sparsely distributed and far away from groups with higher density.

Practice: Python3+sklearn library.

The specific training process can include:

Obtain training data (1-month delay index data), select 80% of the data as the training set, and 20% of the data as the test data, and label each data whether it is normal or abnormal. For example, if the delay is 0.3ms, it is normal; if the delay is 2ms, it is abnormal.

Through model=Isolation Forest(n_estimators, max_samples), use the Isolation Forest function of Sklearn to build an isolated forest judge model (model), and use 80% of the data X as the training set, and substitute it into the model.fit(X) training function for training. Get the pre-trained isolation forest judge.

After the model is trained with the training set, it is evaluated with the test set (20% of the data) with the result, and the model score of the preliminarily trained isolation forest judger is obtained.

For example, the model score of the isolation forest discriminator can be calculated by the following formula (2).

Among them, the first number is the number of judging results as abnormal and correct; the second number is the number of judging results as normal and correct; both the normal score and the abnormal score are 50.

Disrupt the order of the training data, and ensure that the number of normal and abnormal numbers is consistent, such as 10,000 normal and 10 abnormal, modify the parameters of n_estimators (number of subtrees), max_depth (maximum growth depth of the tree), Other parameters are selected as default, so as to obtain models of multiple sets of isolated forest judgers, and finally obtain a score array.

Among them, the reason for choosing these two parameters is that these two parameters have the greatest influence on the model. n_estimators, the selection range is (3-23), and max_depth selection (5-25). First fix max_depth to 5, then traverse n_estimators, then fix n_estimators to 3, traverse max_depth.

Finally, the model with the highest score is selected as the model used.

Among them, the training data is a two-dimensional array. For example, the success rate data is [[0.88], [0.90], [0.88], [0.99]], and the result label of the corresponding data is a one-dimensional array. It is the same as The data corresponds in order, [-1, 1, 1, -1], where -1 means abnormal and 1 means normal. If 0.88 corresponds to -1, it is abnormal data.

To put it simply, through model=Isolation Forest(n_estimators, max_samples), use Sklearn's Isolation Forest function to build an isolation forest judge model (model), and use 80% of the data X as the training set, and substitute it into model.fit(X) for training function for training, and then use 20% of the data Y to substitute into the model.predict(Y) prediction function for prediction, then compare the prediction results with the marked results, select the two parameters with the highest model scores, and save them.

Outlier check of the first instance at the current point in time:

1. Save the n_estimators and max_samples parameters obtained in the training phase, and substitute these parameters into the Isolation Forest function to establish the isolation forest judge model.

2. According to the above structure, generate a two-dimensional array with the data of the last two days, and substitute it into the model training function model.fit function to train the corresponding isolated forest judge model.

3. The data of the first instance at the current time point is also built into a two-dimensional array, which is substituted into the model to establish the isolated forest judge model or also called the prediction function (model.predict), and run the isolated forest judge model. Get a judgment result.

4. If the judgment result is -1, it means abnormal; if the judgment result is 1, it means normal.

Step B3. Determine whether the first instance is faulty based on empirical information.

The B3 process can also be called an anomaly correction process, that is, to correct the judgment result of the isolated forest judger.

Reasons for correcting abnormalities: When most abnormalities and a small number of normal cases occur, normal instances will be judged as abnormal. In this case, a secondary correction of the abnormality is required.

Solve the problem: When the indicators are mostly abnormal and a few are normal, there will be misjudgments. That is, it will judge the normal indicators as abnormal, which will lead to misoperation.

Specific correction procedures may include:

For the delay index, since the lower the delay, the lower the possibility of abnormality, so when the isolated forest judges that index A is abnormal, but its delay is lower than the average value of other instance indexes, then modify the result and judge index A as normal.

For the success rate, since the higher the success rate, the greater the possibility of abnormality, assuming that the isolated forest judges that the index B is abnormal, but its success rate is higher than the average success rate of other indicators, then modify the result and judge that instance B is normal instance.

Step B4. Determine whether the first instance is faulty based on historical data.

The B4 process can also be called the historical data verification process, that is, the correction result of B3 is verified.

The purpose of verification: to reduce useless operations and alarms.

Assuming that the indicator C of instance A is abnormal, enter the verification and select the historical data of the last week to see if there are any abnormal values.

One judgment: If there is already abnormal data in history, judge whether indicator C is more serious than the historical abnormal value, for example, the success rate of indicator B is 80%, and in history, 95% is the abnormal point, then 80% must be the abnormal point .

Secondary judgment: remove the highest 5 points, remove the lowest 5 points, and then calculate the maximum and minimum values, assuming that the abnormal point is between the maximum and minimum values, then change the abnormality to normal.

Generally speaking, as shown in Figure 7, anomaly detection is divided into three steps. The first step is to use the isolation forest judger to make a preliminary judgment. The second step is based on the first step and according to the existing empirical rules , to revise the judgment result, and the third step is to reconfirm the judgment result based on the historical data on the basis of the second step.

In the following, the process of sending the detection result of the first instance to the corresponding executor in stage 3, and the executor performs the corresponding operation on the first instance will be described.

If the detection result of the first instance is abnormal, the detection result of the first instance is sent to the monitored device to which the first instance belongs, and the corresponding operation is executed through the executor deployed on the monitored device.

Operations include: memory, thread, process dump, restart, isolation, automatic expansion, etc.

Set the operation rule table: For faults caused by abnormal success rate, perform dump and isolation operations. For faults caused by abnormal delay, perform dump and isolation operations. For faults caused by abnormal transaction volume, an alarm is issued.

It should be noted that during isolation, it is necessary to ensure that 2/3 of the instances in the subsystem are available.

Next, the processing procedure of the executor will be described.

Among them, an agent is written in the go language, deployed on the monitored device (such as a virtual machine or container), as a client process (reducing the logic of the business code), and if it is a container, as shown in Figure 8, use edge Car (sidecar) method: In a task (pod), start an agent container, which can be called a sidecar container.

Among them, the role of the actuator is as follows:

Collect error information from instance logs.

Collect business metrics for the instance's logs.

Collect corresponding basic information and various indicator information (memory, CPU, etc.), and then summarize and report the basic information.

Receive various instructions to perform specific operations on the subsystem. For example, isolate, restart processes, and dump memory, threads, and processes.

For example, if the instruction received is to isolate, dump the memory, threads, and processes of the instance, and then isolate.

For example, the received instruction is that the memory is too high: dump the memory, threads, and processes of the instance, and then restart.

When the indicator information is collected, specific processing is performed on the indicator. For example, when a large number of failures occur in a certain instance of a system, it must be isolated. When the business indicators of an instance are abnormal, it can also be isolated.

If the success rate is abnormal and the time delay is abnormal, perform the dump operation directly, and then isolate it. After isolation, it is necessary to analyze whether the resources are insufficient and perform automatic expansion and contraction.

The data processing method provided by the embodiment of the present application has the following beneficial effects:

1. Through the flow chart and the alarm dictionary, an alarm link diagram is generated, which vividly shows the number of alarm levels, alarm content, and upstream and downstream conditions of the subsystem, so as to assist in quickly locating the fault point.

2. Preliminarily find out the abnormal instances from the alarms, and substitute them into our abnormal point detection. On the one hand, screen out the abnormal instances according to the alarms, and then use the algorithm to judge again. By providing targeted exception instances, it is not necessary to substitute all the instance data into the algorithm, which reduces the amount of computation of the algorithm and saves costs; on the other hand, the accuracy of the algorithm is improved by using the correction of exceptions and the inspection of historical data.

3. By modifying the monitoring agent and adding the command execution function, such as saving the instance environment and traffic isolation are more accurate and fast.

In order to implement the above data processing method, a data processing device according to an embodiment of the present application will be described below with reference to the schematic structural diagram of the data processing device shown in FIG. 9 .

As shown in FIG. 9 , the data processing device 90 includes: an acquisition unit 901 , a determination unit 902 , a processing unit 903 and a presentation unit 904 . in:

The obtaining unit 901 is configured to obtain alarm information of at least two node devices included in the data processing system;

The determining unit 902 is configured to determine at least one abnormal subsystem based on the at least two node devices;

The processing unit 903 is configured to execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converting the The alarm information of at least one node device included in the abnormal subsystem is converged into the alarm information of the abnormal subsystem;

The display unit 904 is configured to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem.

In some embodiments, the acquiring unit 901 is further configured to:

Perform the following processing for each of the at least two node devices:

Processing the alarm log reported by the node device to obtain a keyword field representing an alarm prompt;

Based on the warning prompt represented by the keyword field, determine the M first warning levels for the node device, and the number of M first warnings corresponding to the M first warning levels; the M is greater than or equal to 1;

Based on the M first warning levels and the number of the M first warning levels, update the first warning dictionary for the node device to obtain the warning information of the node device; the first warning dictionary includes N N first alarm levels, and N first preset alarm numbers corresponding to the N first alarm levels; the N is greater than or equal to M.

In some embodiments, the processing unit 903 is further configured to:

Obtaining alarm information of at least one node device included in the abnormal subsystem, where the alarm information of the node device includes M first alarm levels and M first alarms corresponding to the M first alarm levels one-to-one quantity;

Based on the M first alarm levels of each of the node devices in the at least one node device, and the number of the M first alarms, determine P second alarm levels for the abnormal subsystem, and determine the P second alarm levels for the abnormal subsystem, and The P second alarm numbers corresponding to the P second alarm levels one-to-one; the P is greater than or equal to the M;

Based on the P second alarm levels and the P second alarm numbers, update a second alarm dictionary for the abnormal subsystem to obtain alarm information for the abnormal subsystem; the second alarm dictionary includes Q second warning levels, and a second preset number of warnings one-to-one corresponding to the Q second warning levels; the Q is greater than or equal to the P.

In some embodiments, the presentation unit 904 is further configured to:

Obtaining a transaction flow diagram; the at least one abnormal subsystem is shown in the transaction flow diagram;

In the transaction flowchart, for each abnormal subsystem in the at least one abnormal subsystem, determine the subsystem node to which the abnormal subsystem belongs in the transaction flowchart;

Displaying alarm information of the abnormal subsystem corresponding to the subsystem node.

In some embodiments, the data processing device 90 further includes an execution unit configured to: perform the following processing for each of the abnormal subsystems in the at least one abnormal subsystem:

Determining that the node device meeting the first condition in the abnormal subsystem is the node device to be processed;

judging whether the node device to be processed is faulty;

When the node device to be processed fails, a first instruction is sent to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.

In some embodiments, the execution unit is further configured to:

determining an alarm score for each of the node devices in the abnormal subsystem;

If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, then determining that the first node device is the node device to be processed; the The first score threshold is greater than the second score threshold; the first node device is any node device in the abnormal subsystem, and the second node device is any node device in the abnormal subsystem except the first node Node devices other than devices.

In some embodiments, the execution unit is further configured to:

Obtain at least one indicator of the node device to be processed;

For each of the indicators in the at least one indicator, calculate the first distance between the indicator and the reference value of the indicator to obtain the judgment result of the indicator; wherein, if the first distance is greater than or If it is equal to the first distance threshold, it is determined that the judgment result is abnormal; if the first distance is less than the first distance threshold, then it is determined that the judgment result is normal;

If the judgment result for each of the indicators in the at least one indicator is normal, it is determined that the node device to be processed is not faulty; otherwise, it is determined that the node device to be processed is faulty.

In some embodiments, if the judgment result is abnormal, the execution unit is further configured to:

In the case where the index includes a time delay, if the time delay of the node device to be processed is less than or equal to a time delay reference value, modify the judgment result for the time delay to be normal;

In the case where the index includes a success rate, if the success rate of the node device to be processed is greater than or equal to a success rate reference value, modify the judgment result for the success rate to be normal.

In some embodiments, if the judgment result of the indicator is abnormal, the execution unit is further configured to:

Determine the normal range of indicators based on historical normal indicator data;

judging whether the index of the node device to be processed belongs to the normal range;

If the index of the node device to be processed belongs to the normal range, the judgment result of modifying the index is normal.

It should be noted that the data processing device provided in the embodiment of the present application includes each included unit, which can be realized by a processor in an electronic device; of course, it can also be realized by a specific logic circuit; in the process of implementation, the processor It can be a central processing unit (CPU, Central Processing Unit), a microprocessor (MPU, Micro Processor Unit), a digital signal processor (DSP, Digital Signal Processor) or a field programmable gate array (FPGA, Field-Programmable Gate Array) wait.

The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the above-mentioned data processing method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

In order to implement the above data processing method, an embodiment of the present application provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above implementation when executing the program The steps in the data processing method provided in the example.

The structural diagram of the electronic device will be described below with reference to the electronic device 100 shown in FIG. 10 .

In an example, the electronic device 100 may be the above-mentioned electronic device. As shown in FIG. 10 , the electronic device 100 includes: a processor 1001 , at least one communication bus 1002 , a user interface 1003 , at least one external communication interface 1004 and a memory 1005 . Wherein, the communication bus 1002 is configured to realize connection and communication between these components. Wherein, the user interface 1003 may include a display screen, and the external communication interface 1004 may include a standard wired interface and a wireless interface.

The memory 1005 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data) Communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).

In a fourth aspect, the embodiments of the present application provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps in the data processing method provided in the above-mentioned embodiments are implemented. .

It should be pointed out here that: the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium and device embodiments of this application, please refer to the description of the method embodiment of this application for understanding.

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in some embodiments" throughout this specification are not necessarily referring to the same embodiments. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation. The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.

Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes: The steps of the foregoing method embodiments; and the foregoing storage media include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks and other media that can store program codes.

Alternatively, if the above-mentioned integrated units of the present application are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

The above is only the embodiment of the present application, but the scope of protection of the present application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A data processing method, the method is applied to a control device in a data processing system, the data processing system further includes a node device, the method includes:

Acquiring alarm information of at least two node devices included in the data processing system;

determining at least one abnormal subsystem based on the at least two node devices;

Executing a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first processing includes: the abnormal subsystem included The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem;

For each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.
The method according to claim 1, the acquiring the alarm information of at least two node devices included in the data processing system comprises:

Perform the following processing for each of the at least two node devices:

Processing the alarm log reported by the node device to obtain a keyword field representing an alarm prompt;

Based on the warning prompt represented by the keyword field, determine the M first warning levels for the node device, and the number of M first warnings corresponding to the M first warning levels; the M is greater than or equal to 1;

Based on the M first warning levels and the number of the M first warning levels, update the first warning dictionary for the node device to obtain the warning information of the node device; the first warning dictionary includes N N first alarm levels, and N first preset alarm numbers corresponding to the N first alarm levels; the N is greater than or equal to M.
The method according to claim 1, the converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem includes:

Obtaining alarm information of at least one node device included in the abnormal subsystem, where the alarm information of the node device includes M first alarm levels and M first alarms corresponding to the M first alarm levels one-to-one quantity;

Based on the M first alarm levels of each of the node devices in the at least one node device, and the number of the M first alarms, determine P second alarm levels for the abnormal subsystem, and determine the P second alarm levels for the abnormal subsystem, and The P second alarm numbers corresponding to the P second alarm levels one-to-one; the P is greater than or equal to the M;

Based on the P second alarm levels and the P second alarm numbers, update a second alarm dictionary for the abnormal subsystem to obtain alarm information for the abnormal subsystem; the second alarm dictionary includes Q second warning levels, and a second preset number of warnings one-to-one corresponding to the Q second warning levels; the Q is greater than or equal to the P.
According to the method according to claim 1, for each abnormal subsystem in the at least one abnormal subsystem, correspondingly displaying the alarm information of the abnormal subsystem includes:

Obtaining a transaction flow diagram; the at least one abnormal subsystem is shown in the transaction flow diagram;

In the transaction flowchart, for each abnormal subsystem in the at least one abnormal subsystem, determine the subsystem node to which the abnormal subsystem belongs in the transaction flowchart;

Displaying alarm information of the abnormal subsystem corresponding to the subsystem node.
The method according to any one of claims 1-4, further comprising:

performing the following processing for each of the exception subsystems in the at least one exception subsystem:

Determining that the node device meeting the first condition in the abnormal subsystem is the node device to be processed;

judging whether the node device to be processed is faulty;

When the node device to be processed fails, a first instruction is sent to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.
According to the method according to claim 5, the determining that the node device satisfying the first condition in the abnormal subsystem is a node device to be processed comprises:

determining an alarm score for each of the node devices in the abnormal subsystem;

If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, then determining that the first node device is the node device to be processed; the The first score threshold is greater than the second score threshold; the first node device is any node device in the abnormal subsystem, and the second node device is any node device in the abnormal subsystem except the first node Node devices other than devices.
According to the method according to claim 5, said judging whether the node device to be processed is faulty comprises:

Obtain at least one indicator of the node device to be processed;

For each of the indicators in the at least one indicator, calculate the first distance between the indicator and the reference value of the indicator to obtain the judgment result of the indicator; wherein, if the first distance is greater than or If it is equal to the first distance threshold, it is determined that the judgment result is abnormal; if the first distance is less than the first distance threshold, then it is determined that the judgment result is normal;

If the judgment result for each of the indicators in the at least one indicator is normal, it is determined that the node device to be processed is not faulty; otherwise, it is determined that the node device to be processed is faulty.
According to the method according to claim 7, if the judgment result of the indicator is abnormal, the method further includes:

In the case where the index includes a time delay, if the time delay of the node device to be processed is less than or equal to a time delay reference value, modify the judgment result for the time delay to be normal;

In the case where the index includes a success rate, if the success rate of the node device to be processed is greater than or equal to a success rate reference value, modify the judgment result for the success rate to be normal.
According to the method according to claim 7, if the judgment result of the index is abnormal, the method further comprises:

Determine the normal range of indicators based on historical normal indicator data;

judging whether the index of the node device to be processed belongs to the normal range;

If the index of the node device to be processed belongs to the normal range, the judgment result of modifying the index is normal.
An electronic device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the data processing method according to any one of claims 1 to 9 when executing the program .
A storage medium on which a computer program is stored, and when the computer program is executed by a processor, the data processing method according to any one of claims 1 to 9 is realized.