CN117130812A - System fault detection method, apparatus, device, medium and program product - Google Patents

System fault detection method, apparatus, device, medium and program product Download PDF

Info

Publication number
CN117130812A
CN117130812A CN202311029019.4A CN202311029019A CN117130812A CN 117130812 A CN117130812 A CN 117130812A CN 202311029019 A CN202311029019 A CN 202311029019A CN 117130812 A CN117130812 A CN 117130812A
Authority
CN
China
Prior art keywords
log data
log
alarm
type
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311029019.4A
Other languages
Chinese (zh)
Inventor
陈栋
周莉
贾渤
梁浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202311029019.4A priority Critical patent/CN117130812A/en
Publication of CN117130812A publication Critical patent/CN117130812A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure provides a system fault detection method, device, equipment, medium and program product, which can be applied to the technical field of computers. The method comprises the following steps: continuously collecting log data of different types in the running state of the system according to a certain time interval; generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores the log data of different types in the time intervals; responding to an alarm signal of the system alarm, and determining a corresponding node from the link diagram according to time information of the alarm signal; determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal; determining abnormal log data corresponding to the alarm type from the log data of the at least one type; and carrying out log analysis according to the abnormal log data.

Description

System fault detection method, apparatus, device, medium and program product
Technical Field
The present disclosure relates to the technical field of computers, and more particularly, to a system fault detection method, apparatus, device, medium, and program product.
Background
As various automation technologies evolve, cloud infrastructure deployments, and micro-service architecture applications, system deployments also grow exponentially. When a certain system in production fails, a large part of reasons are often analyzed, which are caused by the faults of batch tasks in the upstream and downstream systems or the systems. The production environment causes a large number of logs to be generated by the system at a certain moment due to the concurrent access of a large number of users to the system and the uncertainty of the access to the system.
Therefore, when the system fails, an on-line operation and maintenance personnel receives the system alarm, a great amount of time is required to collect the system omnibearing log, and the log is required to be traced upwards for relevant analysis. Not only is log analysis slow, but it also results in a system that is not available for a long period of time, and is less efficient in handling problems.
Disclosure of Invention
In view of the foregoing, embodiments of the present disclosure provide a system fault detection method, apparatus, device, medium, and program product that improve system fault detection efficiency, for at least partially solving the foregoing technical problems.
According to a first aspect of an embodiment of the present disclosure, there is provided a system fault detection method, including: continuously collecting log data of different types in the running state of the system according to a certain time interval; generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores log data of different types in the time intervals; responding to an alarm signal of system alarm, and determining a corresponding node from a link diagram according to time information of the alarm signal; determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal; determining abnormal log data corresponding to the alarm type from at least one type of log data; and carrying out log analysis according to the abnormal log data.
According to an embodiment of the present disclosure, in response to an alarm signal of a system alarm, determining a corresponding node from a link map according to time information of the alarm signal includes: determining a node corresponding to time information from the link diagram according to the time information, wherein the time information is positioned in a time interval corresponding to the node; from the time information, the nodes preceding the time information are determined from the link map.
According to an embodiment of the disclosure, a link graph is generated according to log data, where the log data collected in different time intervals corresponds to different nodes in the link graph, and after each node stores log data of different types in the time interval, the link graph further includes: sequentially arranging a plurality of nodes in a link diagram according to time sequence; and generating identification information of each node according to different types of log data.
According to an embodiment of the present disclosure, the identification information includes: the system comprises a tag and an attribute, wherein the tag is used for representing the running state of the system, and the attribute is used for representing log data of different types.
According to an embodiment of the present disclosure, different types of log data include: the system comprises a website server state log, a website server running log, a website server request database and database response log, a system external request log, a system request upstream and downstream system and response log and a system running error reporting log.
According to an embodiment of the present disclosure, according to an alarm type of an alarm signal, determining at least one type of log data corresponding to the alarm type in the node further includes: under the condition that the alarm type corresponds to the type of the system request upstream and downstream system and the response log, acquiring upstream and downstream nodes of the nodes where the system request upstream and downstream system and the response log are located; and determining at least one type of log data corresponding to the alarm type in the upstream and downstream nodes according to the log data in the upstream and downstream nodes.
According to an embodiment of the present disclosure, continuously collecting log data of different types in a system operation state according to a certain time interval includes: and recording the log data of different types in different files respectively.
According to an embodiment of the present disclosure, in a case where the log data is a website server state log, continuously collecting log data of different types in a system operation state according to a certain time interval includes: responding to a first query instruction, and acquiring a process ID corresponding to a website server according to a certain time interval; and responding to the second query instruction, and acquiring process ID memory occupation information according to a certain time interval.
According to an embodiment of the present disclosure, in a case where log data is a system request upstream and downstream system and a response log, continuously collecting log data of different types in a system operation state according to a certain time interval includes: acquiring a printing log of the system according to a certain time interval; and determining log data of the system request upstream and downstream systems and the response according to the print log.
A second aspect of an embodiment of the present disclosure provides a system fault detection device, including: the acquisition module is used for continuously acquiring log data of different types in the running state of the system according to a certain time interval; the processing module is used for generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores log data of different types in the time intervals; the first determining module is used for responding to an alarm signal of system alarm and determining a corresponding node from the link diagram according to time information of the alarm signal; the second determining module is used for determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal; a third determining module, configured to determine abnormal log data corresponding to the alarm type from at least one type of log data; and the analysis module is used for carrying out log analysis according to the abnormal log data.
A third aspect of the disclosed embodiments provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the system fault detection method described above.
A fourth aspect of the disclosed embodiments also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described system fault detection method.
A fifth aspect of the disclosed embodiments also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described system fault detection method.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a system failure detection method, apparatus, device, medium, and program product according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a system fault detection method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of an acquisition unnecessary type log generation link map according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flowchart of collecting website server state log data, according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of an acquisition system requesting upstream and downstream systems and response log data in accordance with an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart for determining a corresponding node from a link graph based on an alarm signal, in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram for generating a link map from log data according to an embodiment of the disclosure;
FIG. 8 schematically illustrates a flow chart for determining a type of log data in a node from an alarm signal according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a system fault detection device according to an embodiment of the present disclosure; and
fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a system failure detection method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the invention, the related user information (including but not limited to user personal information, user image information, user equipment information, such as position information and the like) and data (including but not limited to data for analysis, stored data, displayed data and the like) are information and data authorized by a user or fully authorized by all parties, and the processing of the related data such as collection, storage, use, processing, transmission, provision, disclosure, application and the like are all conducted according to the related laws and regulations and standards of related countries and regions, necessary security measures are adopted, no prejudice to the public welfare is provided, and corresponding operation inlets are provided for the user to select authorization or rejection.
The embodiment of the disclosure provides a system fault detection method, which comprises the following steps: continuously collecting log data of different types in the running state of the system according to a certain time interval; generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores log data of different types in the time intervals; responding to an alarm signal of system alarm, and determining a corresponding node from a link diagram according to time information of the alarm signal; determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal; determining abnormal log data corresponding to the alarm type from at least one type of log data; and carrying out log analysis according to the abnormal log data.
Fig. 1 schematically illustrates an application scenario diagram of a system failure detection method, apparatus, device, medium and program product according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the system fault detection method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the system fault detection apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The system failure detection method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the system fault detection apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The system failure detection method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 8 based on the scenario described in fig. 1.
Fig. 2 schematically illustrates a flow chart of a system failure detection method according to an embodiment of the present disclosure.
As shown in fig. 2, the system failure detection method of this embodiment includes operations S210 to S260.
In operation S210, different types of log data in the system operation state are continuously collected according to a certain time interval.
In some embodiments, the different types of log data include: the system comprises a website server state log, a website server running log, a website server request database and database response log, a system external request log, a system request upstream and downstream system and response log and a system running error reporting log.
In particular, because of the uncertainty of the system operating state in the production environment, an omnibearing collection of system operating state logs is required. When collecting log data in a classified manner, (1) deploying a web server state log for a system: the process number of the web server is checked through the PS command every 5 seconds, the memory state occupied by the process number operated by the web server is collected through the top command, the log is recorded into the first file, and the first file can be recorded as run. (2) for a web server log: the running log of the web server is collected every 5 seconds and recorded into a second file, which may be recorded as server. (3) Request database and database response log for web server: the web server request database and database response log are collected every 5 seconds and logged into a third file, which may be referred to as database. (4) for a system external request log: the system print log system out log analysis is collected every 5 seconds, the external request log is received and recorded in a fourth file, and the fourth file can be recorded as a request log. (5) request upstream and downstream systems and response logs for the system: the system print log system out log is collected every 5 seconds to analyze the system request upstream and downstream system and response log information and record the information into a fifth file, and the fifth file can be recorded as reqrep log. (6) reporting error log for system operation: every 5 seconds, a system printing log is acquired, error allowance information is reported by the analysis system, and the error allowance information is recorded in a sixth file, and the sixth file can be recorded as error.
The PS command is one of Linux system commands, and is a command for viewing a process in Linux. The PS looks at the process being Running. the top command is one of Linux system commands, and displays related information of a process being executed by the current system, including a process ID, a memory occupancy rate, a CPU occupancy rate and the like. Process ID (processID), PID) is a numerical value used by the kernel of most operating systems to uniquely identify a process.
It should be noted that, the embodiment of the present disclosure does not specifically limit the duration of a certain time interval, and an operation and maintenance person may set different time intervals according to the detection requirement. For example, 5s, 10s, 15s, etc. may be employed in other embodiments. Different time intervals may also be set for different types of log data to be collected. For example, web server state information may be collected every 5s for a system deployment; for the operation log of the website server, the operation log can be collected every 10s, and operation staff can set different time intervals according to maintenance requirements.
In operation S220, a link map is generated according to the log data, wherein the log data collected in different time intervals corresponds to different nodes in the link map, and each node stores different types of log data in the time intervals.
In some embodiments, the link graph is comprised of nodes and relationships. The relation is determined by time sequence, namely, log data acquired at a plurality of continuous time intervals are arranged according to the sequence of acquisition time. By analyzing the different types of log data collected in the above operation S210 and combining the different types of log data collected in the same time interval, a node of the link map is formed.
In response to the alarm signal of the system alarm, a corresponding node is determined from the link map according to time information of the alarm signal in operation S230.
In some embodiments, the link diagram is linked with the monitoring alarm, and when the system alarms, the link diagram is traced back in time according to the time information of the system alarms, so as to find out the corresponding node in the link diagram when the system alarms.
For example, the time information of the system alarm is 2023, 1 month, 1 day, 15 points, 30 minutes and 28 seconds, the starting time and the ending time of the log data collected by a certain node in the link diagram are 2023, 1 month, 1 day, 15 points, 30 minutes and 25 seconds and 2023, 1 month, 1 day, 15 points, 30 minutes and 30 seconds respectively, and the time according to the system alarm is located in the time period of the node, and the node is the node corresponding to the system alarm.
In operation S240, at least one type of log data corresponding to the alarm type in the node is determined according to the alarm type of the alarm signal.
In some embodiments, since different types of log data are recorded in one node, a log data file corresponding to the alarm signal can be found according to the alarm type of the alarm signal.
In operation S250, abnormal log data corresponding to the alarm type is determined from among at least one type of log data.
For example, if it is a web server alert, the web server running status log is queried in real time and the results are logged. If the system is requested to be unresponsive or the system is slow in response, the web server running log is queried in real time, the state information of the system database is queried through a show processing list, whether the database is abnormal or not is judged, the system receives the request processing log information to analyze whether errors are reported or not, whether the request relates to the upstream and downstream information of the system is analyzed, if yes, whether the request is required or not is judged, and the log is recorded. If the database corresponding to the system has alarm information, inquiring the state information of the database of the system through a show processing list, judging whether the database is abnormal or not and the request information received by a divergence system, and recording a log. If the analysis log is caused by no response of the upstream and downstream system, the link map nodes related to upstream and downstream are positioned, and the corresponding log data in each node is queried again according to the alarm type of the alarm signal for analysis, and the log is recorded.
In other embodiments, a plurality of different types of log data in each node may be queried sequentially, and the exception log in each type of log data may be recorded. And similarly, aiming at a plurality of nodes found by the backtracking link diagram, sequentially inquiring a plurality of log data of different types in each node to form a cycle.
Wherein show processing is a thread that shows the user running.
In operation S260, log analysis is performed according to the abnormal log data.
In some embodiments, the abnormal log data is sent to a line operation and maintenance personnel for analysis to obtain a system fault result. The system fault reasons related to the system monitoring alarm can be quickly positioned by first-line operation and maintenance personnel, and the problem can be quickly solved.
It can be understood that the log can be rapidly positioned based on the real-time running state link diagram of the system, so that the system self-inspection can be realized, and the operation and maintenance personnel can be helped to rapidly analyze the cause of the system fault and timely solve or avoid the system fault, thereby ensuring the high availability of the system. The system running state log can be obtained in an omnibearing way through six types of log data, so that the accuracy of system error information analysis is improved.
Fig. 3 schematically illustrates a schematic diagram of collecting no-use type log generation link graphs according to an embodiment of the present disclosure.
As shown in fig. 3, the method for continuously collecting log data of different types in the running state of the system according to a certain time interval according to this embodiment includes:
and recording the log data of different types in different files respectively.
In some embodiments, the collected website server state log is stored in a first file; the website server operation log is stored in a second file; the website server requests the database and the database response log to be stored in a third file; the system external request log is stored in a fourth file; the system requests the upstream and downstream systems and the response log are stored in a fifth file; the system operation error report log is stored in a sixth file.
It will be appreciated that after collecting the different types of log data, different files are used to record the different types of log data, so as to facilitate the segmentation of the different types of log data. And the subsequent independent analysis is convenient according to the log data in each file, and the analysis efficiency is improved.
Fig. 4 schematically illustrates a flowchart of collecting website server state log data according to an embodiment of the present disclosure.
As shown in fig. 4, in the case where the log data is a web server status log, continuously collecting different types of log data in the system running state according to a certain time interval in this embodiment includes operations S410 to S420.
In operation S410, in response to the first query instruction, a process ID corresponding to the web server is acquired according to a certain time interval.
In operation S420, in response to the second query instruction, process ID memory occupancy information is collected according to a certain time interval.
Specifically, the first query instruction is a PS command, and the second query instruction is a top command. As disclosed in operation S210 above. I.e., server state log for system deployment website: and checking the process number of the web server through the PS command every 5 seconds, collecting the memory state occupied by the process number operated by the web server through the top command, and recording the log into the first file.
Fig. 5 schematically illustrates a flow chart of an acquisition system requesting upstream and downstream systems and response log data according to an embodiment of the present disclosure.
As shown in fig. 5, in the case where the log data is a system request upstream and downstream system and a response log, continuously collecting different types of log data in the system operation state according to a certain time interval in this embodiment includes operations S510 to S520.
In operation S510, a print log of the system is acquired according to a certain time interval.
In operation S520, log data of the system request upstream and downstream systems and responses is determined from the print log.
Specifically, for a system external request log: the system print log systemut log analysis is collected every 5 seconds and the external request log is received and recorded in a fourth file.
Fig. 6 schematically illustrates a flow chart for determining a corresponding node from a link graph based on an alarm signal according to an embodiment of the disclosure.
As shown in fig. 6, the determining of the corresponding node from the link map according to the time information of the alarm signal in response to the alarm signal of the system alarm of this embodiment includes operations S610 to S620.
In operation S610, a node corresponding to time information is determined from the link map according to the time information, the time information being located in a time interval corresponding to the node.
In operation S620, a node preceding the time information is determined from the link map according to the time information.
In some embodiments, it is generally required to query log data recorded by a corresponding node according to time information of system alarm, and sometimes it is also required to trace back a link map, find corresponding log data in a previous node, and perform correlation analysis. Therefore, when an alarm exists, the nodes in real time (in alarm) in the link diagram are positioned according to the time information of the alarm signal, and meanwhile, the link diagram is traced back, so that the nodes of the link diagram before the alarm are determined.
Fig. 7 schematically illustrates a flow chart of generating a link map from log data according to an embodiment of the disclosure.
As shown in fig. 7, according to log data, a link map is generated in this embodiment, where log data collected in different time intervals corresponds to different nodes in the link map, and operations S710 to S720 are further included after each node stores log data of different types in the time interval.
In operation S710, a plurality of nodes are sequentially arranged in a link map according to time sequence.
In operation S720, identification information of each node is generated according to different types of log data.
Wherein the identification information includes: the system comprises a tag and an attribute, wherein the tag is used for representing the running state of the system, and the attribute is used for representing log data of different types.
In some embodiments, the node information of the link map is composed of an attribute and a label, the label is runstat (running state), and the attribute is: pid (process number), mem (occupied memory state), databaseinfo (web server requests database and database response information), requestinfo (system receives external request information), reqrepinfo (upstream and downstream system information, system requests upstream and downstream system and response information), error info (system error information). Each node comprises log data of different types collected in a time period corresponding to the node, and a plurality of attribute information are generated according to the log data of different types in each node.
It can be understood that each node in the link map is sequentially arranged according to the time sequence of collecting log data; and each node has corresponding identification information so as to quickly locate the corresponding node according to the error reporting signal.
Fig. 8 schematically illustrates a flow chart for determining a type of log data in a node from an alarm signal according to an embodiment of the present disclosure.
As shown in fig. 8, according to the alarm type of the alarm signal, the embodiment determines at least one type of log data corresponding to the alarm type in the node, and further includes operations S810 to S840.
In operation S810, it is determined whether the alarm type corresponds to the type of the system request upstream and downstream systems and the response log.
In operation S820, in case it is determined that the alarm type corresponds to the type of the system request upstream and downstream system and the response log, the upstream and downstream nodes of the node where the system request upstream and downstream system and the response log are located are acquired.
In operation S830, at least one type of log data corresponding to the alarm type in the upstream and downstream nodes is determined according to the log data in the upstream and downstream nodes.
In some embodiments, where the alarm type corresponds to the type of system request upstream and downstream systems and response logs, the upstream and downstream nodes need to be analyzed together. And finding out log data corresponding to the alarm type in the upstream and downstream nodes to perform association analysis.
In the case where it is determined in operation S840 that the alarm type does not correspond to the type of the system request upstream and downstream systems and the response log, then there is no need to trace back the upstream and downstream nodes.
Based on the system fault detection method, the disclosure also provides a system fault detection device. The device will be described in detail below in connection with fig. 9.
Fig. 9 schematically shows a block diagram of a system failure detection apparatus according to an embodiment of the present disclosure.
As shown in fig. 9, the system fault detection apparatus 800 of this embodiment includes an acquisition module 810, a processing module 820, a first determination module 830, a second determination module 840, a third determination module 850, and an analysis module 860.
The collection module 810 is configured to continuously collect log data of different types in the system operating state according to a certain time interval. In an embodiment, the acquisition module 810 may be configured to perform the operation S210 described above, which is not described herein.
The processing module 820 is configured to generate a link map according to the log data, where the log data collected in different time intervals corresponds to different nodes in the link map, and each node stores log data of different types in the time interval. In an embodiment, the processing module 820 may be configured to perform the operation S220 described above, which is not described herein.
The first determining module 830 is configured to determine, in response to an alarm signal of a system alarm, a corresponding node from the link map according to time information of the alarm signal. In an embodiment, the first determining module 830 may be configured to perform the operation S230 described above, which is not described herein.
The second determining module 840 is configured to determine, according to the alarm type of the alarm signal, at least one type of log data corresponding to the alarm type in the node. In an embodiment, the second determining module 840 may be configured to perform the operation S240 described above, which is not described herein.
The third determining module 850 is configured to determine abnormal log data corresponding to the alarm type from at least one type of log data. In an embodiment, the third determining module 850 may be configured to perform the operation S250 described above, which is not described herein.
The analysis module 860 is configured to perform log analysis according to the abnormal log data. In an embodiment, the analysis module 860 may be configured to perform the operation S260 described above, which is not described herein.
Any of the acquisition module 810, the processing module 820, the first determination module 830, the second determination module 840, the third determination module 850, and the analysis module 860 may be combined in one module to be implemented, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 810, the processing module 820, the first determination module 830, the second determination module 840, the third determination module 850, and the analysis module 860 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the acquisition module 810, the processing module 820, the first determination module 830, the second determination module 840, the third determination module 850, and the analysis module 860 may be at least partially implemented as computer program modules, which, when executed, may perform the respective functions.
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a system failure detection method according to an embodiment of the disclosure.
As shown in fig. 10, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the item recommendation method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (13)

1. A system fault detection method, comprising:
continuously collecting log data of different types in the running state of the system according to a certain time interval;
Generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores the log data of different types in the time intervals;
responding to an alarm signal of the system alarm, and determining a corresponding node from the link diagram according to time information of the alarm signal;
determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal;
determining abnormal log data corresponding to the alarm type from the log data of the at least one type;
and carrying out log analysis according to the abnormal log data.
2. The method of claim 1, wherein the determining the corresponding node from the link map based on the time information of the alarm signal in response to the alarm signal of the system alarm comprises:
determining a node corresponding to the time information from the link diagram according to the time information, wherein the time information is positioned in a time interval corresponding to the node;
and determining nodes before the time information from the link graph according to the time information.
3. The method of claim 1, wherein the generating a link map from the log data further comprises:
sequentially arranging a plurality of nodes in the link diagram according to time sequence;
and generating identification information of each node according to the log data of different types.
4. A method according to claim 3, wherein the identification information comprises: the system comprises a tag and an attribute, wherein the tag is used for representing the running state of the system, and the attribute is used for representing the log data of different types.
5. The method of claim 1, wherein the different types of log data comprise: the system comprises a website server state log, a website server running log, a website server request database and database response log, a system external request log, a system request upstream and downstream system and response log and a system running error reporting log.
6. The method of claim 5, wherein said determining, according to the alarm type of the alarm signal, the log data of at least one type of the node corresponding to the alarm type, further comprises:
under the condition that the alarm type corresponds to the type of the system request upstream and downstream system and the response log, acquiring upstream and downstream nodes of the nodes where the system request upstream and downstream system and the response log are located;
And determining at least one type of log data corresponding to the alarm type in the upstream and downstream nodes according to the log data in the upstream and downstream nodes.
7. The method of claim 1, wherein the continuously collecting log data of different types in the system operating state according to a certain time interval comprises:
and recording the log data of different types in different files respectively.
8. The method of claim 5, wherein, in the case that the log data is a web server state log, continuously collecting log data of different types in a system operation state according to a certain time interval, comprises:
responding to a first query instruction, and acquiring a process ID corresponding to the website server according to a certain time interval;
and responding to a second query instruction, and collecting the process ID memory occupation information according to a certain time interval.
9. The method of claim 1, wherein, in the case where the log data is a system request upstream and downstream system and a response log, continuously collecting different types of log data in a system operation state according to a certain time interval includes:
Acquiring a printing log of the system according to a certain time interval;
and determining log data of the system request upstream and downstream systems and response according to the print log.
10. A system fault detection device, comprising:
the acquisition module is used for continuously acquiring log data of different types in the running state of the system according to a certain time interval;
the processing module is used for generating a link graph according to the log data, wherein the log data acquired in different time intervals correspond to different nodes in the link graph, and each node stores the log data of different types in the time intervals;
the first determining module is used for responding to the alarm signal of the system alarm and determining a corresponding node from the link diagram according to the time information of the alarm signal;
the second determining module is used for determining at least one type of log data corresponding to the alarm type in the node according to the alarm type of the alarm signal;
a third determining module, configured to determine abnormal log data corresponding to the alarm type from the at least one type of log data; and
and the analysis module is used for carrying out log analysis according to the abnormal log data.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-9.
12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.
CN202311029019.4A 2023-08-15 2023-08-15 System fault detection method, apparatus, device, medium and program product Pending CN117130812A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311029019.4A CN117130812A (en) 2023-08-15 2023-08-15 System fault detection method, apparatus, device, medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311029019.4A CN117130812A (en) 2023-08-15 2023-08-15 System fault detection method, apparatus, device, medium and program product

Publications (1)

Publication Number Publication Date
CN117130812A true CN117130812A (en) 2023-11-28

Family

ID=88855620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311029019.4A Pending CN117130812A (en) 2023-08-15 2023-08-15 System fault detection method, apparatus, device, medium and program product

Country Status (1)

Country Link
CN (1) CN117130812A (en)

Similar Documents

Publication Publication Date Title
US11593100B2 (en) Autonomous release management in distributed computing systems
CN113157545A (en) Method, device and equipment for processing service log and storage medium
CN109542764B (en) Webpage automatic testing method and device, computer equipment and storage medium
CN110704771B (en) Page abnormality monitoring method, system, device, electronic equipment and readable medium
CN115202973A (en) Application running state determining method and device, electronic equipment and medium
CN117130812A (en) System fault detection method, apparatus, device, medium and program product
KR102349495B1 (en) A computer system and method for processing large log files from virtual servers.
CN113900905A (en) Log monitoring method and device, electronic equipment and storage medium
CN115203178A (en) Data quality inspection method and device, electronic equipment and storage medium
CN114490130A (en) Message subscription method and device, electronic equipment and storage medium
CN116401138B (en) Operating system running state detection method and device, electronic equipment and medium
CN115499292B (en) Alarm method, device, equipment and storage medium
CN113794719B (en) Network abnormal traffic analysis method and device based on elastic search technology and electronic equipment
CN116401319B (en) Data synchronization method and device, electronic equipment and computer readable storage medium
CN114996119B (en) Fault diagnosis method, fault diagnosis device, electronic device and storage medium
CN117573478A (en) Performance monitoring method, device, apparatus, medium and program product
CN115237391A (en) Method, device, electronic equipment and medium for generating script
CN117519722A (en) Code generation method and device, electronic equipment and computer readable storage medium
CN115687284A (en) Information processing method, device, equipment and storage medium
CN117176576A (en) Network resource changing method, device, equipment and storage medium
CN114756441A (en) Method, device, equipment and medium for processing logs of host system
CN117251341A (en) Real-time monitoring method and device for cache service cluster, electronic equipment and medium
CN117149573A (en) Method, device, equipment and storage medium for monitoring data change
CN113419887A (en) Method and device for processing abnormal online transaction of host
CN114048056A (en) Root cause positioning method, apparatus, device, medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination