CN111913824A - Method for determining data link fault reason and related equipment - Google Patents

Method for determining data link fault reason and related equipment Download PDF

Info

Publication number
CN111913824A
CN111913824A CN202010578137.0A CN202010578137A CN111913824A CN 111913824 A CN111913824 A CN 111913824A CN 202010578137 A CN202010578137 A CN 202010578137A CN 111913824 A CN111913824 A CN 111913824A
Authority
CN
China
Prior art keywords
data
file
data file
abnormal
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010578137.0A
Other languages
Chinese (zh)
Other versions
CN111913824B (en
Inventor
谢凌杰
陈洁
李颖
李颢
张新
周政明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202010578137.0A priority Critical patent/CN111913824B/en
Publication of CN111913824A publication Critical patent/CN111913824A/en
Application granted granted Critical
Publication of CN111913824B publication Critical patent/CN111913824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Abstract

The invention provides a method for determining a data link fault reason and related equipment, wherein the method comprises the following steps: acquiring file information of a target data file; searching a data link corresponding to the target data file from a data link full view according to the file information; acquiring data exception information corresponding to the target data file in each node system on the data link; generating a data exception vector of the target data file on a corresponding data link according to the data exception information; inputting the data anomaly vector into a fault cause decision model corresponding to the data link; and acquiring the reason of the data link fault of the target data file output by the fault reason decision model. The method for determining the reason of the data link fault provided by the invention can reduce the dependence on the experience of operation and maintenance personnel, accurately determine the root cause of the data link fault and improve the troubleshooting efficiency.

Description

Method for determining data link fault reason and related equipment
Technical Field
The present invention relates to the field of operation and maintenance technologies, and in particular, to a method for determining a cause of a data link failure and a related device.
Background
In recent years, with the continuous expansion of commercial banking and the constant popularization of big data applications, the amount of data that needs to be processed by banking IT (information technology) systems has increased exponentially. This makes the system pressure on the data link more and more, often can not in time generate and transmit data because of various reasons, may cause important influence to important fund business, supervision and submission and management analysis etc. of bank.
At present, after a data link fails, the failure cause is generally checked and operated manually. However, this method is often passive and difficult to ensure timeliness. Moreover, only the vine can be touched during troubleshooting, only the upstream system can be searched when the downstream system does not receive the data, and if the upstream system confirms that the data is supplied, the upstream system and the downstream system need to jointly troubleshoot whether the transmission tool has problems or not. If the upstream system does not supply data, the upstream system at the previous stage of the upstream system needs to be searched, and whether the upstream system at the previous stage has problems is checked. And the rest is done until the fault reason is found.
Because the upstream data of different data dependencies are different, the architecture logic complexity is also different, the method greatly depends on the experience of operation and maintenance personnel, and is limited in the exposed faults, and the operation and maintenance personnel sometimes consume a large amount of manpower and energy and cannot find the root cause of the fault of the data link.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present invention provide a method and a related system for determining a cause of a data link failure, where a data exception vector of a target data file on a corresponding data link is input into a failure cause decision model corresponding to the data link, so as to accurately determine a cause of a failure of the data link of the target data file, and reduce dependence of failure troubleshooting on experience of operation and maintenance staff.
In a first aspect, an embodiment of the present invention provides a method for determining a cause of a data link failure, where the method includes:
acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
searching a data link corresponding to the target data file from a data link full view according to the file information;
acquiring data exception information corresponding to the target data file in each node system on the data link, wherein the data exception information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the intermediate data file by the node system is abnormal or not, whether system resources of the node system are abnormal or not within a set time period or not, and whether database indexes of the node system are abnormal or not within a set time period or not;
generating a data exception vector of the target data file on a corresponding data link according to the data exception information;
inputting the data abnormal vector into a fault reason decision model corresponding to the data link, wherein the fault reason decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: data abnormal vectors of the sample data file on the data link and a label for identifying a fault reason;
and acquiring the number of link faults of the target data files output by the fault reason decision model.
In one embodiment of the invention, the method further comprises:
monitoring and collecting transmission information of each data file in each node system, wherein the transmission information comprises: the data file information of the current data file, the upstream data file information which the current data file depends on, and the downstream data file information corresponding to the current data file;
and generating the data link full view according to the transmission information of each data file.
In one embodiment of the invention, the method further comprises:
and monitoring and recording data abnormal information of each data file in each node system.
In an embodiment of the present invention, the acquiring data exception information corresponding to the data file in each node system on the data link includes:
and searching data abnormal information corresponding to the target data file in each node system on the data link from the recorded data abnormal information according to the file information.
In one embodiment of the present invention, the failure cause includes: the name of the system with the fault and the reason of the fault system comprise: the system resource is in shortage, the system database is abnormal, the system works wrongly, the system data is not generated, and the system transmission fails.
In a second aspect, an embodiment of the present invention provides an apparatus for determining a cause of a data link failure, where the apparatus includes:
the file information acquisition module is used for acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
the data link acquisition module is used for searching a data link corresponding to the target data file from a data link full view according to the file information;
an abnormal information obtaining module, configured to obtain data abnormal information corresponding to the target data file in each node system on the data link, where the data abnormal information includes: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the intermediate data file by the node system is abnormal or not, whether system resources of the node system are abnormal or not within a set time period or not, and whether database indexes of the node system are abnormal or not within a set time period or not;
an abnormal vector generation module, configured to generate a data abnormal vector of the target data file on a corresponding data link according to the data abnormal information;
an abnormal vector input module, configured to input the data abnormal vector into a fault cause decision model corresponding to the data link, where the fault cause decision model is obtained by training a random forest model using multiple sample data, where the sample data includes: data abnormal vectors of the sample data file on the data link and a label for identifying a fault reason;
and the fault cause obtaining module is used for obtaining the cause of the data link fault of the target data file output by the fault cause decision model.
In one embodiment of the invention, the apparatus further comprises:
the transmission information acquisition module is used for monitoring and acquiring transmission information of each data file in each node system, wherein the transmission information comprises: the data file information of the current data file, the upstream data file information which the current data file depends on, and the downstream data file information corresponding to the current data file;
and the data link full view generating module is used for generating the data link full view according to the transmission information of each data file.
In one embodiment of the invention, the apparatus further comprises:
and the data exception information recording module is used for monitoring and recording the data exception information of each data file in each node system.
In an embodiment of the present invention, the acquiring data exception information corresponding to the target data file in each node system on the data link includes:
and searching data abnormal information corresponding to the target data file in each node system on the data link from the recorded data abnormal information according to the file information.
In one embodiment of the present invention, the failure cause includes: the name of the system with the fault and the reason of the fault system comprise: the system resource is in shortage, the system database is abnormal, the system works wrongly, the system data is not generated, and the system transmission fails.
In a third aspect, the present invention provides a computer storage medium having stored thereon computer instructions that can be executed by a processor to implement the method for determining a cause of a data link failure according to any one of the foregoing embodiments.
In a fourth aspect, an embodiment of the present invention provides a computer device, including:
a memory having a computer program stored thereon;
a processor for executing the computer program to implement the method for determining a cause of a data link failure according to any of the foregoing embodiments.
Compared with the prior art, the method provided by the embodiment of the invention has the following beneficial technical effects:
according to the method and the related equipment for determining the data link fault reason provided by the embodiment of the invention, the data link corresponding to the target data file is searched from the data link full view, the data abnormal vector is generated according to the data abnormal information corresponding to the target data file in each node system on the data link, and then the data abnormal vector is input into the fault reason decision model to determine the fault reason of the data link of the target data file, so that the dependence of fault troubleshooting on the experience of operation and maintenance personnel can be reduced, the root cause of the fault of the data link of the target data file can be accurately determined, and the labor and the time consumed by troubleshooting are saved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a full view of a data link according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of determining a cause of a data link failure according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a random forest model according to an embodiment of the invention;
FIG. 4 illustrates a training diagram for training an initial decision tree, according to one embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for determining a cause of a data link failure according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, terms related to the embodiments of the present invention are briefly described.
Operation: the data processing component of the IT system comprises three parts of data receiving, data processing and data transmission, and the processing of the data files is mainly realized through self-defined program logic.
Data link: in the IT operation and maintenance practice, a virtual line is drawn according to the upstream and downstream dependency relationship of each file of data files.
Full view of data link: the upstream and downstream dependency graphs of multiple files in multiple systems are usually a mesh directed graph. Each file data file may have a "one-to-one", "many-to-one", and "one-to-many" relationship with its upstream and downstream data files.
Arrival time and job processing time: each file of data has ITs specific service definition and usage, and in order to satisfy the specific functions of each IT system, each file of data usually has the latest generation and arrival time required, and the corresponding processing job is also required to be completed within a fixed time range, otherwise there is a service impact.
Fig. 1 is a schematic structural diagram of a data link full view according to an embodiment of the present invention. As shown in fig. 1, A, B1, B2, C are four node systems. The system A is the most upstream supply system, which generates a plurality of files of data, wherein some files of data are supplied to the system B1, and some files of data are supplied to the system B2. The B1 system and the B2 system process the received data files through different jobs, respectively, and then transmit the processed data files to the downstream C system. The C system processes the received data file through a job, thereby generating a final data file.
If the target data file does not arrive on time or an error occurs, i.e., the data link of the target data file fails, an alarm may be generated. There are many reasons for the failure of the data link of the target data file, which may be that a problem occurs in the file transmission of a certain upstream node system of the data file, a problem occurs in the work processing process of a certain upstream node system of the data file, or that a data file in a certain upstream node system of the data file is not generated in time, and the like.
In order to determine the root cause of the data link failure of the target data file, the present embodiment provides a method for determining the cause of the data link failure. Fig. 2 shows a flow chart of a method of determining a cause of a data link failure according to an embodiment of the invention. As shown in fig. 2, the method for determining the cause of the data link failure according to this embodiment includes:
s101: file information of a target data file is obtained, wherein the file information uniquely identifies the target data file.
The target data file can be a final data file which needs to be obtained, whether the final data file arrives on time or whether an error occurs can be monitored in a terminal node system of the data link, namely whether the target data file is abnormal or not is monitored, and if the target data file is abnormal, the data link corresponding to the target data file is judged to have a fault, and an alarm is generated. When an alarm occurs, the file information of the target data file which causes the alarm can be obtained according to the alarm information. The file information of the target data file is a unique identifier of the target data file, and can be formed by combining a system where the target data file is located, a path and a name of the target data file.
S102: and searching a data link corresponding to the target data file from a data link full view according to the file information.
The method includes the steps of obtaining a pre-generated full view of a data link, and searching the data link corresponding to a target data file according to file information of the target data file.
The full view of the data link represents the upstream and downstream dependency relationships of each data file in each system, and in an implementation manner of this embodiment, the full view of the data link may be generated in advance by:
and monitoring each data file in each system, and collecting transmission information of each data file. And after the transmission information of each data file is collected, generating a data link full view representing the upstream and downstream dependency relationship of each data file according to the collected transmission information of each data file.
The systems may be cross-platform systems, such as Linux systems, HP-UX systems, AIX systems, Windows, and so on. A system may include multiple hosts, and a proxy script (e.g., a shell script or a python script) may be deployed on each host of each system to collect transmission information of each data file. For a system, the transmission information of the collected data file may include data file information of the data file itself in the system, for example, the name of the data file, the name of the system where the data file is located, and the path; upstream data file information on which the data file depends, such as the name of the upstream data file, the system and the path in which the upstream data file resides; and downstream data file information corresponding to the data file, such as the name of the downstream data file, the system and the path where the downstream data file is located.
S103: acquiring data exception information corresponding to the target data file in each node system on the data link, wherein the data exception information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the node system to the intermediate data file is abnormal or not, whether system resources of the node system in a set time period are abnormal or not, and whether database indexes of the node system in the set time period are abnormal or not.
The data link corresponding to the target data file is a directional data link, and the node systems on the data link may sequentially include a first-level node system, a second-level node system, …, an nth-level intermediate node system, and other node systems. The data files of the adjacent node systems have upstream and downstream dependency relationships or corresponding relationships. For a target data file, a data file corresponding to the target data file in each node system on a corresponding data link is an intermediate data file of the target data file. The data link corresponding to the target data file can be determined according to the file information of the target data file, and then the node system name on the data link and the file information of the intermediate data file corresponding to the target data file in the node systems are determined.
The inventor finds that there are many reasons for the target data file to have a fault in the process of implementing the embodiment of the present invention, and mainly there are:
1. the intermediate data file of the upstream node system is not generated;
2. the upstream node system overtime the operation execution due to the reasons of insufficient system resources, operation scheduling congestion, operation processing logic errors, abnormal data formats and the like;
3. the file transmission between the upstream node system and the downstream node system is abnormal, or the transmission is interrupted due to network failure, and the intermediate data file generated by the upstream node system is not successfully transmitted to the downstream node system;
4. the downstream node system cannot successfully receive the intermediate data file sent by the upstream due to insufficient disk space, insufficient system resources, abnormal transmission components and the like;
5. the abnormal database of the node system causes the abnormal reading and writing of the file.
Therefore, after determining the data link corresponding to the target data file, the node system name on the data link, and the file information of the intermediate data file corresponding to the target data file in the node system, the following data exception information may be collected in the node system on the data link in this embodiment:
1. whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal (i.e., whether file generation is abnormal). For example, whether the intermediate data file under the specified directory of the node system is generated or not may be monitored, and the size of the intermediate data file may be obtained to determine whether the generation of the intermediate data file is abnormal or not. If the intermediate data file in the designated directory is not generated or the size of the generated intermediate data file is abnormal, for example, 0KB, it is determined that the generation of the intermediate data file is abnormal. The designated directory may be a directory in the node system, where the intermediate data file corresponding to the target data file is to be stored.
In some embodiments, if there are a plurality of intermediate data files corresponding to the target data file in the node system, information on whether file generation of each intermediate data file is abnormal may be aggregated. For example, if the intermediate data files corresponding to the target data file in the B system are file 1 and file 2, the generation of file 1 is normal (value 1), and the generation of file 2 is abnormal (value 0), the and operation may be performed on the values of file 1 and file 2, thereby determining that the file generation of the intermediate data file corresponding to the target data file in the node system is abnormal (value 0 after the and operation).
2. And whether the operation processing of the node system is abnormal in the process of generating the intermediate data file corresponding to the target data file or not is judged. For example, it may be determined whether the job processing of the intermediate data file corresponding to the target data file in the node system is abnormal by analyzing the job log in the node system by the script. When a plurality of intermediate data files corresponding to the target data file are provided, the information about whether the work process of each intermediate data file is abnormal or not can be aggregated, so that whether the work process of the node system is abnormal or not in the process of generating the intermediate data file corresponding to the target data file is determined.
3. And the node system judges whether the file transmission of the intermediate data file corresponding to the target data file is abnormal or not. For example, the transmission log may be analyzed by a script to determine whether an abnormality occurs in file transmission of an intermediate data file corresponding to the target data file in the node system. When a plurality of intermediate data files corresponding to the target data file are provided, the information about whether the file transmission of each intermediate data file is abnormal or not may be aggregated, so as to determine whether the file transmission of the intermediate data file corresponding to the target data file in the node system is abnormal or not.
4. And whether the system resources of the node system in a set time period are abnormal or not. The system resources may include a CPU usage rate, a memory usage rate, a disk IO response time, a file usage rate, and the like, and whether the system resources are abnormal may be determined by determining whether an average usage rate of the system resources in a set time period is greater than a set threshold. For example, then 10: 30 to 11: 00, and judging whether the average value of the CPU utilization rates in the time period is larger than a set threshold value or not so as to determine whether the CPU resources are in shortage or not. Based on the same principle, whether the memory resource is in tension or not can be determined by judging whether the memory utilization rate in the set time period is greater than the set threshold value or not, whether the disk IO response time in the set time period is greater than the set threshold value or not is determined by judging whether the disk IO response time in the set time period is greater than the set threshold value or not, and the like. Then, the exception information of each system resource may be aggregated, so as to determine whether the system resource of the node system is abnormal. For example, when one of the system resources is abnormal, it may be determined that the system resource of the node system is abnormal.
The set time period for collecting system resource items such as CPU usage, memory usage, disk IO response time, and file usage may be determined by information of an intermediate data file corresponding to a target data file in a node system, for example, the set time period may be set according to the arrival time of the set intermediate data file.
5. Whether the database index of the node system in the set time period is abnormal (namely whether the database is abnormal or not). The database index may include whether there is an ultra-long SQL, a large transaction, whether a session is blocked, whether there is a deadlock, whether there is an invalid index, and the like, and the database index information of a set time period may be obtained through a script. Then, whether the database of the node system is abnormal or not can be determined according to the database index information. For example, when an abnormality occurs in one of the index items, it may be determined that the database of the node system is abnormal. The set time period for collecting the database index may be determined by information of an intermediate data file corresponding to the target data file in the node system, and may be set according to a job processing time of the intermediate data file, for example.
In one implementation manner of this embodiment, data abnormality information, such as file generation information, file transmission information, job processing information, resource information of each system, database index information, and the like, of each data file in each system may be monitored and recorded at intervals. Therefore, when the data link corresponding to the target data file is found according to the file information of the target data file, the intermediate data file corresponding to the target data file can be found from the recorded information, and the data exception information corresponding to the target data file is further obtained.
For example, a collection script may be set on each host of each system, data abnormality information of each data file in the system may be collected at intervals, and the collected data abnormality information of each data file may be recorded in the data abnormality information recording table. After the file information of the target data file and the corresponding data link are obtained, the data abnormal information of the target data file on the corresponding data link can be searched from the data abnormal information recording table according to the file information of the target data file.
S104: and generating a data exception vector of the target data file on a corresponding data link according to the data exception information.
After the data anomaly information is obtained, the data anomaly information may be processed, e.g., digitized and normalized, to form a data anomaly vector for the target data file on its corresponding data link. The data exception vector may be a data string composed of 0 and 1, each bit representing a kind of data exception information in a node system, where 0 may represent exception and 1 may represent normal.
For example, file 1 corresponds to a data link from a- > B- > C, and the data exception vector of file 1 on its corresponding data link is: [1,1,1,1,1,0,1,1,0,0,0,0,0, 0], wherein the first 5 bits respectively indicate whether file generation is abnormal, job processing is abnormal, file transfer is abnormal, system resources are abnormal, and a database is abnormal in the system A, the middle 5 bits respectively indicate whether file generation is abnormal, job processing is abnormal, file transfer is abnormal, system resources are abnormal, and a database is abnormal in the system B, and the last 5 bits respectively indicate whether file generation is abnormal, job processing is abnormal, file transfer is abnormal, system resources are abnormal, and a database is abnormal in the system C. Of course, for the target data file, the file generation information of the end node system C may be the default, that is, the file generation exception information of the C node system may be omitted, so as to obtain a 14-bit data exception vector.
S105: inputting the data abnormal vector into a fault reason decision model corresponding to the data link, wherein the fault reason decision model is obtained by training a random forest model by using multiple groups of sample data, and the sample data comprises: and data exception vectors of the sample data file on the data link and a label for identifying a fault reason.
Specifically, for a target data file, data exception information items on a corresponding data link of the target data file may be correlated, and it is often impossible to determine which reason causes the target data file exception, that is, the root cause of the data link failure of the target data file, only according to the data exception vector of the target data file. For example, if the data exception vector corresponding to the target data file is [0,0,0,1,1,0,0,0,1,1,0,0,0,1], the root cause of the data link failure of the target data file cannot be determined according to the data exception vector.
In order to determine the root cause of the data link failure of the target data file, the present embodiment inputs the data anomaly vector of the target data file into the failure cause decision model corresponding to the data link to determine the root cause of the data link failure of the target data file.
The random forest model can be trained by inputting the data abnormal vector of the sample data file and the fault reason label thereof into the random forest model, and a fault reason decision model is obtained in advance.
The failure cause label may be composed of the name of the failed system that failed, the reason the failed system failed. Reasons for a failure of a failed system may include: system resource shortage, system database exception, system operation error, system data non-generation, system transmission failure, etc. For example, the failure cause label may be sys _ A _ gen, which indicates that the failure cause is a file generation failure of system A.
The random forest model is an integrated learning model, completes a learning task by constructing and combining a plurality of decision trees, and has better generalization performance and accuracy than a single decision tree model. Fig. 3 is a schematic diagram of a random forest model according to an embodiment of the present invention. As shown in fig. 3, in this embodiment, a plurality of mutually independent decision trees are generated, then data abnormal vectors are respectively input into each decision tree to obtain a determination result of each decision tree, and then a final determination result is determined by adopting a "minority obeying majority" principle, that is, if the number of decision trees with the same determination result is greater than a set threshold (for example, more than half), the same determination result is determined as a final result; if not, outputting the judgment result of each decision tree and manually confirming. From this, the root cause of the data link failure is determined. A training diagram for training an initial decision tree is shown in fig. 4.
In a feasible implementation manner of this embodiment, different failure cause decision models may be trained through different sample data files, and each failure cause decision model corresponds to one data link. After the data link corresponding to the target data file is determined, the data abnormal vector of the target data file can be input into the fault cause decision model corresponding to the data link to determine the root cause of the fault of the data link of the target data file.
S106: and acquiring the reason of the data link fault of the target data file output by the fault reason decision model.
After the data exception vector of the target data file on the corresponding data link is input into the fault cause decision model corresponding to the data link, the fault cause decision model may output a fault cause tag of the target data file, for example, sys _ a _ gen, which indicates that the root cause of the data link fault of the target data file is a file generation fault of the system a.
Fig. 5 is a schematic diagram illustrating an apparatus for determining a cause of a data link failure according to an embodiment of the present invention. As shown in fig. 5, the apparatus 10 for determining a cause of a data link failure according to the present embodiment may include: the system comprises a file information acquisition module 11, a data link acquisition module 12, an abnormal information acquisition module 13, an abnormal vector generation module 14, an abnormal vector input module 15 and a fault cause acquisition module 16.
The file information acquiring module 11 is configured to acquire file information of a target data file, where the file information uniquely identifies the target data file;
the data link acquisition module 12 is configured to search a data link corresponding to the target data file from a full data link view according to the file information;
an abnormal information obtaining module 13, configured to obtain data abnormal information corresponding to the target data file in each node system on the data link, where the data abnormal information includes: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the intermediate data file by the node system is abnormal or not, whether system resources of the node system are abnormal or not within a set time period or not, and whether database indexes of the node system are abnormal or not within a set time period or not;
an abnormal vector generation module 14, configured to generate a data abnormal vector of the target data file on a corresponding data link according to the data abnormal information;
an abnormal vector input module 15, configured to input the data abnormal vector into a fault cause decision model corresponding to the data link, where the fault cause decision model is obtained by training a random forest model using multiple sets of sample data, where the sample data includes: and data exception vectors of the sample data file on the data link and a label for identifying a fault reason.
And a failure cause obtaining module 16, configured to obtain a cause of the data link failure of the target data file output by the failure cause decision model.
In one implementation manner of the present embodiment, the apparatus 10 further includes:
the transmission information acquisition module is used for monitoring and acquiring transmission information of each data file in each node system, wherein the transmission information comprises: the data file information of the current data file, the upstream data file information depended on by the current data file and the downstream data file information depended on by the current data file;
and the data link full view generating module is used for generating the data link full view according to the transmission information of each data file.
In one implementation manner of the present embodiment, the apparatus 10 further includes:
and the data exception information recording module is used for monitoring and recording the data exception information of each data file in each node system.
In an implementation manner of this embodiment, the acquiring data exception information corresponding to the target data file in each node system on the data link includes:
and searching data abnormal information corresponding to the target data file in each node system on the data link from the recorded data abnormal information according to the file information.
In an implementation manner of this embodiment, the failure cause includes: the name of the system with the fault and the reason of the fault system comprise: the system resource is in shortage, the system database is abnormal, the system works wrongly, the system data is not generated, and the system transmission fails.
The apparatus for determining the cause of the data file failure in this embodiment may be used to implement the technical solution of the above method embodiment of the present invention, and the implementation principle and the technical effect are similar, which are not described herein again.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Yet another embodiment of the present invention provides a computer storage medium, such as a hard disk, an optical disk, a flash memory, a floppy disk, a magnetic tape, etc., having computer readable instructions stored thereon, which can be executed by a processor to implement the method for determining the cause of a data link failure according to any of the above-mentioned embodiments.
Yet another embodiment of the present invention provides a computer apparatus including:
a memory having a computer program stored thereon,
a processor which can execute the computer program to implement the method of determining a cause of a data link failure as described in any of the above embodiments.
The terms and expressions used in the present specification are used as terms of illustration only and are not meant to be limiting. It will be appreciated by those skilled in the art that changes could be made to the details of the above-described embodiments without departing from the underlying principles thereof. The scope of the invention is, therefore, indicated by the appended claims, in which all terms are intended to be interpreted in their broadest reasonable sense unless otherwise indicated.

Claims (12)

1. A method of determining a cause of a data link failure, the method comprising:
acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
searching a data link corresponding to the target data file from a data link full view according to the file information;
acquiring data exception information corresponding to the target data file in each node system on the data link, wherein the data exception information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the intermediate data file by the node system is abnormal or not, whether system resources of the node system are abnormal or not within a set time period or not, and whether database indexes of the node system are abnormal or not within a set time period or not;
generating a data exception vector of the target data file on a corresponding data link according to the data exception information;
inputting the data abnormal vector into a fault reason decision model corresponding to the data link, wherein the fault reason decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: data abnormal vectors of the sample data file on the data link and a label for identifying a fault reason;
and acquiring the reason of the data link fault of the target data file output by the fault reason decision model.
2. The method of claim 1, further comprising:
monitoring and collecting transmission information of each data file in each node system, wherein the transmission information comprises: the data file information of the current data file, the upstream data file information which the current data file depends on, and the downstream data file information corresponding to the current data file;
and generating the data link full view according to the transmission information of each data file.
3. The method of claim 2, further comprising:
and monitoring and recording data abnormal information of each data file in each node system.
4. The method according to claim 3, wherein the obtaining of the data exception information corresponding to the target data file in each node system on the data link comprises:
and searching data abnormal information corresponding to the target data file in each node system on the data link from the recorded data abnormal information according to the file information.
5. The method of claim 1, wherein the cause of failure comprises: the name of the system with the fault and the reason of the fault system comprise: the system resource is in shortage, the system database is abnormal, the system works wrongly, the system data is not generated, and the system transmission fails.
6. An apparatus for determining a cause of a data link failure, the apparatus comprising:
the file information acquisition module is used for acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
the data link acquisition module is used for searching a data link corresponding to the target data file from a data link full view according to the file information;
an abnormal information obtaining module, configured to obtain data abnormal information corresponding to the target data file in each node system on the data link, where the data abnormal information includes: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal or not, whether work processing of the node system in the process of generating the intermediate data file is abnormal or not, whether file transmission of the intermediate data file by the node system is abnormal or not, whether system resources of the node system are abnormal or not within a set time period or not, and whether database indexes of the node system are abnormal or not within a set time period or not;
an abnormal vector generation module, configured to generate a data abnormal vector of the target data file on a corresponding data link according to the data abnormal information;
an abnormal vector input module, configured to input the data abnormal vector into a fault cause decision model corresponding to the data link, where the fault cause decision model is obtained by training a random forest model using multiple sample data, where the sample data includes: data abnormal vectors of the sample data file on the data link and a label for identifying a fault reason;
and the fault cause obtaining module is used for obtaining the cause of the data link fault of the target data file output by the fault cause decision model.
7. The apparatus of claim 6, further comprising:
the transmission information acquisition module is used for monitoring and acquiring transmission information of each data file in each node system, wherein the transmission information comprises: the data file information of the current data file, the upstream data file information which the current data file depends on, and the downstream data file information corresponding to the current data file;
and the data link full view generating module is used for generating the data link full view according to the transmission information of each data file.
8. The apparatus of claim 7, further comprising:
and the data exception information recording module is used for monitoring and recording the data exception information of each data file in each node system.
9. The apparatus according to claim 8, wherein the obtaining of the data exception information corresponding to the target data file in each node system on the data link comprises:
and searching data abnormal information corresponding to the target data file in each node system on the data link from the recorded data abnormal information according to the file information.
10. The apparatus of claim 6, wherein the cause of failure comprises: the name of the system with the fault and the reason of the fault system comprise: the system resource is in shortage, the system database is abnormal, the system works wrongly, the system data is not generated, and the system transmission fails.
11. A computer storage medium having stored thereon computer instructions executable by a processor to perform the method of any one of claims 1 to 5.
12. A computer device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program to implement the method of any one of claims 1 to 5.
CN202010578137.0A 2020-06-23 2020-06-23 Method for determining data link fault cause and related equipment Active CN111913824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010578137.0A CN111913824B (en) 2020-06-23 2020-06-23 Method for determining data link fault cause and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010578137.0A CN111913824B (en) 2020-06-23 2020-06-23 Method for determining data link fault cause and related equipment

Publications (2)

Publication Number Publication Date
CN111913824A true CN111913824A (en) 2020-11-10
CN111913824B CN111913824B (en) 2024-03-05

Family

ID=73226479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010578137.0A Active CN111913824B (en) 2020-06-23 2020-06-23 Method for determining data link fault cause and related equipment

Country Status (1)

Country Link
CN (1) CN111913824B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641736A (en) * 2021-10-13 2021-11-12 云和恩墨(北京)信息技术有限公司 Method and device for displaying session blocking source
CN113672776A (en) * 2021-08-25 2021-11-19 中国农业银行股份有限公司 Fault analysis method and device
CN114356617A (en) * 2021-11-29 2022-04-15 苏州浪潮智能科技有限公司 Error injection testing method, device and system and computing equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050193281A1 (en) * 2004-01-30 2005-09-01 International Business Machines Corporation Anomaly detection
US20070198445A1 (en) * 2006-02-22 2007-08-23 Microsoft Corporation Techniques to organize test results
CN102611568A (en) * 2011-12-21 2012-07-25 华为技术有限公司 Failure service path diagnosis method and device
CN108809731A (en) * 2018-06-28 2018-11-13 珠海兴业新材料科技有限公司 A kind of control method dimming optical projection system business datum chain based on subway
CN109218114A (en) * 2018-11-12 2019-01-15 西安微电子技术研究所 A kind of server failure automatic checkout system and detection method based on decision tree
CN109298703A (en) * 2017-07-25 2019-02-01 富泰华工业(深圳)有限公司 Fault diagnosis system and method
CN110493025A (en) * 2018-05-15 2019-11-22 中国移动通信集团浙江有限公司 It is a kind of based on the failure root of multilayer digraph because of the method and device of diagnosis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050193281A1 (en) * 2004-01-30 2005-09-01 International Business Machines Corporation Anomaly detection
US20070198445A1 (en) * 2006-02-22 2007-08-23 Microsoft Corporation Techniques to organize test results
CN102611568A (en) * 2011-12-21 2012-07-25 华为技术有限公司 Failure service path diagnosis method and device
CN109298703A (en) * 2017-07-25 2019-02-01 富泰华工业(深圳)有限公司 Fault diagnosis system and method
CN110493025A (en) * 2018-05-15 2019-11-22 中国移动通信集团浙江有限公司 It is a kind of based on the failure root of multilayer digraph because of the method and device of diagnosis
CN108809731A (en) * 2018-06-28 2018-11-13 珠海兴业新材料科技有限公司 A kind of control method dimming optical projection system business datum chain based on subway
CN109218114A (en) * 2018-11-12 2019-01-15 西安微电子技术研究所 A kind of server failure automatic checkout system and detection method based on decision tree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨波: "基于数据链的软件故障定位方法", 软件学报, vol. 26, no. 2, 28 February 2015 (2015-02-28), pages 254 - 268 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672776A (en) * 2021-08-25 2021-11-19 中国农业银行股份有限公司 Fault analysis method and device
CN113672776B (en) * 2021-08-25 2024-04-12 中国农业银行股份有限公司 Fault analysis method and device
CN113641736A (en) * 2021-10-13 2021-11-12 云和恩墨(北京)信息技术有限公司 Method and device for displaying session blocking source
CN113641736B (en) * 2021-10-13 2022-01-25 云和恩墨(北京)信息技术有限公司 Method and device for displaying session blocking source
CN114356617A (en) * 2021-11-29 2022-04-15 苏州浪潮智能科技有限公司 Error injection testing method, device and system and computing equipment
CN114356617B (en) * 2021-11-29 2024-03-08 苏州浪潮智能科技有限公司 Error injection testing method, device, system and computing equipment

Also Published As

Publication number Publication date
CN111913824B (en) 2024-03-05

Similar Documents

Publication Publication Date Title
US9542255B2 (en) Troubleshooting based on log similarity
CN106656536B (en) Method and equipment for processing service calling information
CN110287052B (en) Root cause task determination method and device for abnormal task
US20170109657A1 (en) Machine Learning-Based Model for Identifying Executions of a Business Process
US20170109676A1 (en) Generation of Candidate Sequences Using Links Between Nonconsecutively Performed Steps of a Business Process
US20170109668A1 (en) Model for Linking Between Nonconsecutively Performed Steps in a Business Process
CN111913824B (en) Method for determining data link fault cause and related equipment
US20170109667A1 (en) Automaton-Based Identification of Executions of a Business Process
CN113282461B (en) Alarm identification method and device for transmission network
US20170109636A1 (en) Crowd-Based Model for Identifying Executions of a Business Process
US20170109639A1 (en) General Model for Linking Between Nonconsecutively Performed Steps in Business Processes
CN113946499A (en) Micro-service link tracking and performance analysis method, system, equipment and application
US20170109638A1 (en) Ensemble-Based Identification of Executions of a Business Process
CN114880312B (en) Flexibly-set application system service data auditing method
CN107579944B (en) Artificial intelligence and MapReduce-based security attack prediction method
US20170109640A1 (en) Generation of Candidate Sequences Using Crowd-Based Seeds of Commonly-Performed Steps of a Business Process
WO2021109874A1 (en) Method for generating topology diagram, anomaly detection method, device, apparatus, and storage medium
US20170109637A1 (en) Crowd-Based Model for Identifying Nonconsecutive Executions of a Business Process
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
US20170109670A1 (en) Crowd-Based Patterns for Identifying Executions of Business Processes
US20230306343A1 (en) Business process management system and method thereof
CN113285978B (en) Fault identification method based on block chain and big data and general computing node
CN113138906A (en) Call chain data acquisition method, device, equipment and storage medium
CN112579402A (en) Method and device for positioning faults of application system
CN116109112B (en) Service data processing method, device, medium and equipment based on aggregation interface

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant