WO2023050967A1 - 一种系统异常检测处理方法及装置 - Google Patents

一种系统异常检测处理方法及装置 Download PDF

Info

Publication number
WO2023050967A1
WO2023050967A1 PCT/CN2022/104378 CN2022104378W WO2023050967A1 WO 2023050967 A1 WO2023050967 A1 WO 2023050967A1 CN 2022104378 W CN2022104378 W CN 2022104378W WO 2023050967 A1 WO2023050967 A1 WO 2023050967A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
subsystems
logs
real
detection
Prior art date
Application number
PCT/CN2022/104378
Other languages
English (en)
French (fr)
Inventor
姜磊
刘学生
徐代刚
李小进
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023050967A1 publication Critical patent/WO2023050967A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Definitions

  • Embodiments of the present disclosure relate to the communication field, and in particular, relate to a system abnormality detection and processing method and device.
  • anomaly detection and positioning problems are very important.
  • operators pay more attention to whether the functions are continuously available, such as whether the resource data and performance data reported to the operator's OSS network management are missing, and whether the reported alarm data of network elements is too delayed.
  • log analysis is a very important means of protection. If the device or the software running on it fails, no matter whether an alarm will be generated or not, log analysis is very critical and necessary to locate the root cause of the abnormality and solve the fault.
  • Figure 1 is a schematic diagram of the data flow direction of the telecommunications security network management system in the related technology. After corresponding processing and transformation, data and resource data are reported northward to the upper-level operator's Operation Support Systems (OSS) network management for centralized processing. Due to the complexity of the business, this system is composed of multiple subsystems, including alarm subsystem, performance subsystem, resource subsystem, database PG and Kafka service and other subsystems. Alarm and performance resources belong to the business subsystem, while database PG and kafka services, as well as FTP and NTP not shown in the figure, belong to basic services.
  • OSS Operation Support Systems
  • Embodiments of the present disclosure provide a system anomaly detection and processing method and device to at least solve the problems in the related art that the same anomaly detection method cannot be adapted to different subsystems and cannot effectively eliminate the abnormality of the entire system.
  • a system abnormality detection and processing method including:
  • Anomaly detection processing is performed on the system according to the detection results of the multiple subsystems and the real-time data of the multiple subsystems.
  • a system anomaly detection and processing device including:
  • the first acquisition module is configured to acquire real-time data of multiple subsystems in the system within a preset time period
  • the first classification module is configured to classify the real-time logs in the real-time data of the multiple subsystems respectively, and obtain the classification results of the real-time logs of the multiple subsystems;
  • the first abnormality detection module is configured to perform abnormality detection on the log according to the abnormality detection mode corresponding to the classification result, and obtain detection results of multiple subsystems;
  • the second abnormality detection module is configured to perform abnormality detection processing on the system according to the detection results of the multiple subsystems and the real-time data of the multiple subsystems.
  • a computer-readable storage medium where a computer program is stored in the storage medium, wherein the computer program is set to execute any one of the above method embodiments when running in the steps.
  • an electronic device including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above Steps in the method examples.
  • the real-time data of multiple subsystems in the system are acquired within a preset time period; the real-time logs in the real-time data of the multiple subsystems are respectively classified, and the classification results of the real-time logs of the multiple subsystems are obtained; Perform anomaly detection on the log according to the abnormality detection method corresponding to the classification result, and obtain the detection results of multiple subsystems; Anomaly detection processing can solve the problem that the same anomaly detection method in related technologies cannot be adapted to different subsystems, and cannot effectively eliminate the anomalies of the entire system.
  • the logs in each subsystem are classified, and different logs are analyzed using different anomaly detection methods , Based on the detection results of each subsystem and real-time data, the system performs abnormal detection and processing in a unified manner, which is convenient for assisting in locating abnormalities and root causes of failures.
  • Fig. 1 is a schematic diagram of the data flow of the telecommunications security network management system in the related art
  • FIG. 2 is a block diagram of a hardware structure of a mobile terminal of a system abnormality detection and processing method according to an embodiment of the present disclosure
  • FIG. 3 is a flow chart of a system abnormality detection and processing method according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of an anomaly detection system architecture according to an embodiment of the present disclosure.
  • Fig. 5 is a schematic diagram of a structured log printed by receiving an alarm in the southbound direction according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of printing a structured log of an alarm processed by kafka according to an embodiment of the present disclosure
  • Fig. 7 is a schematic diagram of a structured log sent by the northbound module to the OSS for printing an alarm according to an embodiment of the present disclosure
  • Fig. 8 is a schematic diagram of a semi-structured log of intermediate processing alarm printing according to an embodiment of the present disclosure
  • FIG. 9 is a schematic diagram of a log aggregation process according to an embodiment of the present disclosure.
  • Fig. 10 is a schematic diagram of a log exception flag according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of a two-stage anomaly detection process according to an embodiment of the present disclosure.
  • Fig. 12 is a block diagram of a system abnormality detection processing device according to another embodiment of the present disclosure.
  • FIG. 2 is a block diagram of the hardware structure of the mobile terminal according to an embodiment of the present disclosure.
  • the mobile terminal may include one or more (in FIG. 2 only Shown is a) a processor 102 (the processor 102 may include but not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the above-mentioned mobile terminal may also include a The transmission device 106 and the input and output device 108 of the communication function.
  • the structure shown in FIG. 2 is only for illustration, and it does not limit the structure of the above mobile terminal.
  • the mobile terminal may also include more or fewer components than those shown in FIG. 2, or have a different configuration from that shown in FIG.
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the system abnormality detection processing method in the embodiment of the present disclosure, and the processor 102 runs the computer program stored in the memory 104, thereby Execute various functional applications and service chain address pool slicing processing," to realize the above-mentioned method.
  • the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile memory Volatile solid-state memory.
  • memory 104 can further include memory that is remotely set relative to processor 102, and these remote memory can be connected to the mobile terminal through a network. Examples of the above-mentioned network include but are not limited to the Internet, intranet , local area network, mobile communication network and their combination.
  • the transmission device 106 is used to receive or transmit data via a network.
  • the specific example of the above network may include a wireless network provided by the communication provider of the mobile terminal.
  • the transmission device 106 includes a network interface controller (NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • a system abnormality detection and processing method running on the above-mentioned mobile terminal or network architecture is provided, which is applied to the terminal, and the terminal accesses the current master node of the source area through a dual connection (Dual Connection, referred to as DC)
  • DC Dual Connection
  • the MN cell and the current secondary node SN cell Fig. 3 is a flow chart of a system abnormality detection processing method according to an embodiment of the present disclosure, as shown in Fig. 3 , the process includes at least the following steps:
  • Step S302 acquiring real-time data of multiple subsystems in the system within a preset time period
  • Step S304 respectively classifying the real-time logs in the real-time data of multiple subsystems to obtain the classification results of the real-time logs of multiple subsystems;
  • the above step S304 may specifically include: classifying the real-time logs of multiple subsystems according to log sources into: operating system logs, basic service logs, and application logs.
  • the real-time data in this embodiment at least includes real-time logs, scaling conditions of microservices, scope of operating resources of microservices, and call consumption time between microservices.
  • Step S306 according to the abnormal detection method corresponding to the classification result, the log is detected abnormally, and the detection results of multiple subsystems are obtained;
  • the above step S306 may specifically include: performing the following operations on the real-time logs of each of the multiple subsystems to obtain the detection results of the multiple subsystems, wherein the real-time log being executed is called the current log: current log
  • the detection result of the current log is determined through the key fields of the current log
  • the log is an application log
  • the current log is input into the pre-trained classification detection model, and the output of the classification detection model is obtained. Detection results of the current log.
  • step S308 abnormality detection processing is performed on the system according to the detection results of the multiple subsystems and the real-time data of the multiple subsystems.
  • the above step S308 may specifically include: input the detection results of multiple subsystems and the real-time data of multiple subsystems into the pre-trained target anomaly detection model, and obtain the target anomaly detection result of the system output by the target anomaly detection model .
  • the above method before inputting the current log into the pre-trained classification detection model and obtaining the detection result of the current log output by the classification detection model, the above method further includes: determining that the current log is a structured log, specifically, judging Whether the log is an unstructured log or a semi-structured log; when the log is an unstructured log or a semi-structured log, convert the log to a structured log; vectorize the current log to obtain a log vector; according to the log vector The key field of the log vector is aggregated to obtain multiple call chains of the current log.
  • the above method further includes: acquiring historical data of a predetermined number of multiple subsystems and the abnormality detection results of the corresponding systems, wherein the historical data includes at least historical logs, microservice bulletins The shrinkage situation, the scope of the operating resources of the microservices, and the call consumption time between the microservices; respectively classify the historical logs in the historical data of the preset number of multiple subsystems, and obtain the classification results of the historical logs of the multiple subsystems; Perform anomaly detection on historical logs according to the abnormality detection methods corresponding to the classification results, and obtain the detection results of a predetermined number of subsystems; according to the detection results of a predetermined number of multiple subsystems, the data of a predetermined number of multiple subsystems, and the corresponding system The anomaly detection results of the initial anomaly detection model are trained to obtain the trained target anomaly detection model.
  • the detection results train the initial anomaly detection model to obtain the target anomaly detection model, in which the detection results of a predetermined number of subsystems and the historical data of a predetermined number of subsystems are the input of the initial anomaly detection model, and the trained target anomaly
  • the target anomaly detection result of the system output by the detection model and the actual corresponding system anomaly detection result satisfy the preset objective function.
  • the method further includes: when the abnormal detection result of the system indicates that there is an abnormality, according to the Based on the detection results of the system, the root cause of the fault is located and processed for the abnormality.
  • multiple models are obtained through data mining and machine learning for historical logs in the system.
  • different logs are vectorized according to different models, and deep learning analysis and processing are performed according to their respective models, and then Unify and centralize analytics to assist in locating anomalies and root causes.
  • Fig. 4 is a schematic diagram of an anomaly detection system architecture according to an embodiment of the present disclosure. As shown in Fig. 4 , this embodiment includes: a log classifier, a log converter, a tool allocator, an anomaly detector, and a machine learning device. Initially, the logs and modules of the network management, the related exception knowledge base, and the service call chain are all initialized. Specifically include:
  • Step S401 the log classifier performs log classification on the historical log data
  • Step S402 the log converter converts the historical log into a structured log
  • Step S403 log vectorization and aggregation
  • Step S404 machine learning machine training anomaly detection model
  • Step S405 evaluate whether the model training is completed, if the judgment result is yes, execute step S407, otherwise execute step S406;
  • Step S406 machine learning machine tuning ("adjusting the parameters of the abnormality detection model);
  • Step S407 release model ("trained abnormality detection model);
  • Step S408 the log classifier performs log classification on the real-time log data
  • Step S409 the log converter converts the real-time log into a structured log
  • Step S410 log vectorization and aggregation
  • Step S411 the tool allocator acquires corresponding anomaly detectors for different logs
  • Step S412 the anomaly detector detects system anomalies through the anomaly detection model
  • Step S413 the load locates the root cause.
  • the reasoning side makes judgments based on the learned model, and then assists in root cause location.
  • the log classifier classifies the logs, the log converter converts unstructured logs and semi-structured logs into structured logs, and realizes vectorization and clustering, and the tool allocator assigns different detection tools or methods to different logs , the machine learner implements machine learning training, and the anomaly detector implements two-stage anomaly detection and preliminary localization.
  • Step 1 Establish an initial knowledge base
  • the knowledge base is divided into: system call chain, exception knowledge base and structured log template library.
  • the system call chain includes the call relationship and propagation relationship between all microservices, as well as the application microservice name, process name, thread list, and log file name.
  • the exception knowledge base includes exception dictionaries and exception hyperparameters and failure stack patterns.
  • the exception dictionary includes common system errors such as "FATAL”, “Error”, and “Out Of Memory”.
  • Generation GC exceeds the preset time, such as 2 seconds, etc.
  • the fault stack is not necessarily an exception, but it is very helpful for locating the exception, so it is necessary to identify what is a stack log, so there needs to be a certain field or pattern to identify the fault stack, such as training through NB (Naive Bayes) and Identify log paragraphs that contain the words "Caused by:" and "at”.
  • Structured log template library that is, template definition for subsequent identification of structured logs, unstructured logs, and semi-structured logs, such as JSON structured logs and logs with clear field definitions (such as operating system logs, GC logs, and application Logs output by calling Log4J and Logback).
  • Step 2 The log classifier classifies the logs
  • the log classifier classifies real-time logs or training data (historical log data), and classifies them according to the log source, such as operating system logs, basic service logs such as database or kafka logs, or application logs.
  • Step 3 Log vectorization and log aggregation
  • the so-called structured log is narrowly defined as a log defined in JSON (JavaScript Object Notation, a structured data) format. In a broad sense, it is a log whose content can be extracted according to a certain template. In a broad sense, it generally refers to a log that can Logs extracted according to a certain template, such as the logs shown in Figures 5, 6, and 7.
  • JSON JavaScript Object Notation, a structured data
  • logs of the underlying operating system or basic services are structured logs, while the upper-layer application logs can generally be divided into semi-structured logs and unstructured logs.
  • Unstructured logs mean that when an application prints logs, it does not call a standard log library. For example, a Java application does not call log modules such as log4j or Logback, but prints logs for debugging itself. With the standardization of programs and logs, this All kinds of logs are very few and can be ignored.
  • FIG. 8 is a schematic diagram of a semi-structured log printed by an intermediate processing alarm in an embodiment of the present disclosure, as shown in Figure 8 , the log shows the unstructured information in the semi-structured log.
  • Logs are aggregated after vectorization.
  • the so-called aggregation refers to the aggregation of logs of the same nature. For example, an alarm received from a lower-level network element in the southbound direction is converted and sent to OSS through the northbound direction. There are logs recorded by Kafka and the database, and the intermediate alarm processing module may also record. If these logs are not aggregated, they will be mixed with other logs (such as another alarm or performance or resource logs). After aggregation, you can clearly see this All process processing of data.
  • Aggregation is based on the key dimension of the log vector. Taking the log of the alarm as an example, the aggregation is performed through the dimension of the unique identifier in the alarm vector (such as alarm title + alarm occurrence time + alarm occurrence location). In this way, the dimension of recording the log time, You can see the time course of this alarm in different process processing.
  • Step 4 Exception definition and flagging
  • exceptions are divided into two categories, one is functional exceptions and the other is non-functional exceptions.
  • the system take Diagram 1 as an example. If the OSS system receives a northbound alarm or a missing performance file, it is a functional abnormality. If it is received but delayed, it is a non-functional abnormality. If the system returns an error when the user operates, it is a functional abnormality. If the user feels that the system is stuck during operation, there may be internal functional abnormalities or non-functional abnormalities.
  • a software system is composed of various subsystems, and the abnormality of the system must be caused by the abnormality of one or more application modules or subsystems, but the abnormality of the subsystem does not necessarily lead to the abnormality of the whole system.
  • NTP Network Time Protocol
  • NTP Server Network Time Protocol Server
  • This NTP Client may print an Error exception, but not necessarily Affect the business operation of the entire system.
  • the third step log After aggregation, the key vector, such as an alarm, is compared from the southbound log to the northbound log to see if it is complete and marks the abnormality, and calculates whether the delay exceeds the OSS standard through the log timestamp, and then integrates whether the number of delayed entries exceeds the OSS within a period of time Requirements (such as a delay of no more than 1%) to mark whether it is abnormal.
  • the key vector such as an alarm
  • Step 5 Train the anomaly detection model
  • An anomaly detector needs to detect anomalies in the system through certain rules or models, and this model is trained by historical log data and corresponding labels. Anomaly detection and judgment are performed on different subsystems first, and then an overall learning and training is carried out after summarization to obtain a model of whether the final system is abnormal.
  • exception logs of the operating system or underlying support services are generally simple and clear, and can be directly passed through key fields, such as "Fatal Error", etc. Not all abnormal prints in log files are real abnormalities.
  • an application or a supporting service needs to synchronize the clock with the clock server NTP Server as an NTP Client, but it may not be able to connect to the NTP Server at a certain time.
  • the client may print Error exceptions, but it does not affect the business operation of the entire system; not all exceptions can be directly obtained through the exception printing of the log text, such as delayed reporting, which may be slow in the intermediate process (such as a bug in the program),
  • the application program will only print the timestamp of the processing, and will not print the exception; it is not that some application functions are abnormal, but the entire system is abnormal.
  • Figure 1 if the resource reception of the external system OSS fails, then the northbound sending resource of the system The data will also be abnormal, but the entire data conversion network management system may still be normal.
  • the anomaly detector needs to first perform anomaly detection and judgment on different application subsystems, and then conduct an overall learning and training after summarizing to obtain a model of whether the final system is abnormal, including: a given time window, call chain, and propagation chain to obtain Corresponding module logs; according to the classification of log sources in the first step, check different anomaly detection models for different logs; use special fields such as "FATAL" for operating systems and basic services, and identify them if any Abnormal; for application logs, after aggregation of structured and vectorized historical log data, it can be detected according to functional abnormality and non-functional abnormality, and can be compared according to the vector of end log and start log; for application log, you can Use Naive Bayesian (NB) or Support Vector Machine (SVM for short) to perform binary classification learning to obtain the model; detect the virtual machine logs running on the application, such as GC logs, to determine whether there are FULL GC and new students If the generation GC exceeds the preset time (such as 2 seconds), it will identify the
  • the trained model is the unified anomaly detection model.
  • Step 6 The anomaly detector performs two-stage detection and module positioning on real-time services.
  • the business detector When the real-time system is running, the business detector first performs two-stage detection, and then conducts preliminary abnormal location to assist in the final root cause analysis.
  • the tool allocator flexibly assigns different tools and models to different logs according to log classification, according to The models learned by each judge whether they are abnormal; then process according to their respective characteristics, and then perform unified abnormal detection to obtain the conclusion that the system is abnormal.
  • the two-stage detection is more important in judging whether the system has non-functional abnormalities. Still taking the alarm from southbound to northbound as an example, it is clear that the system has no problem of missing alarm reporting, but after the system has abnormal alarm delays, it is clear that the southbound The received data is not missing but there is a delay. According to the call chain, it is sorted by the deviation variance of the call consumption time, and then the module is located according to the abnormal occurrence of each module.
  • Step 7 Determine the root cause of the failure.
  • log classification definitions and log conversion templates commonly used dictionaries, log tool allocation, and machine learning training can all be defined in the design mode interface, and the running mode is to specifically perform classification, conversion, tool allocation, etc.
  • This embodiment can more accurately judge whether the system is abnormal according to different module application types, different logs and different abnormality detection methods; the two-stage abnormality judgment can more accurately judge whether the system is abnormal; Better reference function; through flexible allocation of different methods and tools, abnormal location can be faster.
  • FIG. 9 is a schematic diagram of a log aggregation process according to an embodiment of the present disclosure, as shown in FIG. 9 , including:
  • semi-structured logs are converted into structured logs according to templates and key fields, and then vectorized.
  • the dimension of the vector is based on the log recording time, level (DEBUG/INFO/WARN/ERROR/FATAL), calling class method, thread name and microservice name, and key dimensions of log information.
  • Key dimensions of log information with alarms and performance data as examples.
  • the key dimensions are alarm title/occurrence time/occurring network element/related ID, etc. They can be combined in different dimensions or combined into one dimension, but they together form the key dimension.
  • performance data it refers to statistical network elements/statistical time/statistical data file name, etc.
  • Marking application exceptions is relatively easy for the underlying system and basic services, and there are obvious error prints, such as FATAL and other fields, but for the application layer, even if there is an error stack, it does not mean that there is an exception, so it needs to pass
  • the call chain is combined with logs to analyze.
  • Whether the application call is abnormal is marked according to whether the function is completed normally.
  • the time difference of all calls can be counted, determined according to the normal distribution, and extreme deviations are located for abnormalities; although the data volume of a single event, such as a single alarm from south to north, remains unchanged, if the volume of other events in the overall system changes It may cause time deviation, so it is relatively inaccurate.
  • Fig. 10 is a schematic diagram of a log exception flag according to an embodiment of the present disclosure, as shown in Fig. 10 , including:
  • step S1006 the lack of granularity function is abnormal, and then execute step S1007;
  • step S1007 locate the missing module according to the aggregation vector, and then perform step S1013;
  • FIG. 11 is a schematic diagram of the two-stage anomaly detection process of the embodiment of the present disclosure, as shown in FIG. 11 , including:
  • the tool allocator flexibly allocates different tools and models according to log categories
  • Whether the whole system is abnormal is determined by the common influence of abnormal services of each subsystem. First judge whether each subsystem is abnormal, and then judge whether the whole system is abnormal. If the whole system is judged to be abnormal, then assist in finding the root cause according to the abnormalities of the subsystems. For basic support systems, such as microservice systems, judge directly through the key fields in the log. For basic services, such as FTP and database, it is also judged directly by the key fields in the log. For GC logs, judge whether there are FULLGC and new generation GC beyond the standard time to judge the abnormality. For each application subsystem divided by function, functional abnormalities are judged according to whether the log vectors from the beginning to the end are complete, and non-functional abnormalities are judged according to the time difference from the beginning to the end.
  • the feature engineering is based on the log information, excluding the log recording time, other information, including the level, and the calling method (some exceptions will appear in the exception handling class, so Need it), and threads (some exceptions will print logs in the exception handling thread, so it is needed), and each field of log information, according to NB (Naive Bayesian) to learn the classification model.
  • the call chain is judged according to the call time distribution. For the same business data, it is abnormal for a long time or a short time, because there may be an exception in the middle, and the exception will jump out directly. For different business data, such as a large-scale alarm storm and Sparse alarms are reported, and the time calls are inconsistent. Therefore, it is necessary to perform linear regression according to the scale to judge whether it is normal. Linear regression can use the memory, threads, and number of services (such as reported alarms) and log size (generally speaking) within a specified period of time. , the business volume is large, and the log is also large), and these features are used to learn the linear regression trend model.
  • the overall system is abnormal, and some subsystems are abnormal. Whether the entire system is abnormal may have an absolute factor, or it may not be important. Therefore, it is necessary to learn the model again through machine learning.
  • the resources (memory/CPU/IO) of each microservice, in which the resource data itself is linear, can be divided into 5 dimensions according to a step of 20%, such as CPU, consumption 0-20%, 20%-40%, 40% -60%, 60%-80%, 90%-100%, which dimension is 1, and other dimensions are 0, to define.
  • FIG. 12 is a block diagram of a system anomaly detection and processing device according to another embodiment of the present disclosure. As shown in FIG. 12 , it includes:
  • the first acquisition module 122 is configured to acquire real-time data of multiple subsystems in the system within a preset time period
  • the first classification module 124 is configured to classify the real-time logs in the real-time data of multiple subsystems respectively, and obtain the classification results of the real-time logs of multiple subsystems;
  • the first abnormality detection module 126 is configured to perform abnormality detection on the log according to the abnormality detection method corresponding to the classification result, and obtain detection results of multiple subsystems;
  • the second abnormality detection module 128 is configured to perform abnormality detection processing on the system according to the detection results of the multiple subsystems and the real-time data of the multiple subsystems.
  • the first classification module 124 is further configured to
  • the real-time logs of multiple subsystems are classified into: operating system logs, basic service logs, and application logs.
  • the first anomaly detection module 126 is also set to
  • the detection result of the current log is determined through the key field of the current log
  • the device also includes:
  • the vectorization processing module is configured to perform vectorization processing on the current log to obtain a log vector
  • the aggregation module is configured to aggregate the log vectors according to the key fields of the log vectors to obtain multiple call chains of the current log.
  • the above-mentioned determining module is also set to
  • the log When the log is an unstructured log or a semi-structured log, convert the log to a structured log.
  • the second anomaly detection module 128 is further configured to
  • the detection results of multiple subsystems and the real-time data of multiple subsystems are input into the pre-trained target anomaly detection model, and the target anomaly detection results of the system output by the target anomaly detection model are obtained.
  • the above-mentioned device also includes:
  • the second acquisition module is set to a predetermined number of historical data of multiple subsystems and abnormality detection results of the corresponding systems;
  • the second classification module is configured to classify the historical logs in the historical data of a plurality of subsystems of a preset number respectively, and obtain classification results of the historical logs of the plurality of subsystems;
  • the third anomaly detection module is configured to perform anomaly detection on historical logs according to the anomaly detection methods corresponding to the classification results, and obtain the detection results of a predetermined number of subsystems;
  • the training module is configured to train the initial anomaly detection model according to the detection results of a predetermined number of multiple subsystems, the data of a predetermined number of multiple subsystems, and the anomaly detection results of the corresponding systems, so as to obtain the trained target anomaly detection model .
  • the above-mentioned training module is also set to
  • the detection results of the system and the historical data of a predetermined number of multiple subsystems are the input of the initial anomaly detection model, and the target anomaly detection results of the system output by the trained target anomaly detection model and the actual corresponding system anomaly detection results meet the preset goals function.
  • the device also includes:
  • the root cause location module is configured to perform fault root cause location processing on the abnormality according to the detection results of multiple subsystems when the abnormality detection result of the system indicates that there is an abnormality.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments when running.
  • the above-mentioned computer-readable storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk magnetic disk or optical disk and other media that can store computer programs.
  • Embodiments of the present disclosure also provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
  • each module or each step of the above-mentioned disclosure can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network composed of multiple computing devices In fact, they can be implemented in program code executable by a computing device, and thus, they can be stored in a storage device to be executed by a computing device, and in some cases, can be executed in an order different from that shown here. Or described steps, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present disclosure is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本公开实施例提供了一种系统异常检测处理方法及装置,该方法包括:获取预设时间段内系统中多个子系统的实时数据;分别对该多个子系统的实时数据中的实时日志进行分类,得到该多个子系统的实时日志的分类结果;分别根据该分类结果对应的异常检测方式对该日志进行异常检测,得到多个子系统的检测结果;根据该多个子系统的检测结果与该多个子系统的实时数据对该系统进行异常检测处理,可以解决相关技术中相同的异常检测方式不能适应不同子系统,且无法有效排除整个系统异常的问题,不同类型日志采用不同的异常检测方式进行分析,基于各个子系统的检测结果与实时数据统一对系统进行异常检测处理,便于协助定位异常和故障根因。

Description

一种系统异常检测处理方法及装置
相关申请的交叉引用
本公开基于2021年09月29日提交的发明名称为“一种系统异常检测处理方法及装置”的中国专利申请CN2021111152914.6,并且要求该专利申请的优先权,通过引用将其所公开的内容全部并入本公开。
技术领域
本公开实施例涉及通信领域,具体而言,涉及一种系统异常检测处理方法及装置。
背景技术
在电信行业的运维保障中,异常检测和定位问题是非常重要的一环。除了系统的稳定性外,运营商更关注功能是否持续可用,如上报给运营商的OSS网管的资源数据和性能数据是否缺失,上报的网元发生的告警数据是否延迟太多等。其中日志分析是一个非常重要的保障手段。设备或者上面运行的软件如果出现故障,不管是否会产生告警,定位异常根因并解决故障,日志分析都是非常关键和必要的。
图1是相关技术中电信保障网管系统数据流向的示意图,如图1所示,南向需要接受多个下级网元管理系统(Element Management System,简称为EMS)的业务数据,如告警数据、性能数据和资源数据等,经过相应处理转换后北向上报到上级运营商的运维运营系统(Operation Support Systems,简称为OSS)网管集中处理。这个系统由于业务复杂性,由多个子系统构成,有告警子系统,性能子系统,资源子系统,数据库PG和卡夫卡kafka服务等子系统。告警和性能资源属于业务子系统,而数据库PG和kafka服务以及图中未展示的FTP、NTP等属于基础服务。
仅仅对北向发送日志和南向接受日志的时间对比和个体告警性能对比,只能发现异常,但无法定位哪个模块出现异常。通过人工去检索所有内部日志以发现问题显然是不现实的,同样,对不同子系统的不同格式和不同目的的日志,采样相同的分析工具和方法也是不可行的。有些子系统如数据库、操作系统以及JAVA内存垃圾回收(Garbage Collect,简称为GC)日志,都有专门的日志分析工具,对比较复杂格式的非格式化数据,有开源的工具如Drain等,但由于日志内容有较强的目的性,各人自扫门前雪,它们并不能有效进行整体系统排查。
针对相关技术中相同的异常检测方式不能适应不同子系统,且无法有效排除整个系统异常的问题,尚未提出解决方案。
发明内容
本公开实施例提供了一种系统异常检测处理方法及装置,以至少解决相关技术中相同的异常检测方式不能适应不同子系统,且无法有效排除整个系统异常的问题。
根据本公开的一个实施例,提供了一种系统异常检测处理方法,包括:
获取预设时间段内系统中多个子系统的实时数据;
分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时 日志的分类结果;
分别根据所述分类结果对应的异常检测方式对所述日志进行异常检测,得到多个子系统的检测结果;
根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理。
根据本公开的另一个实施例,还提供了一种系统异常检测处理装置,包括:
第一获取模块,设置为获取预设时间段内系统中多个子系统的实时数据;
第一分类模块,设置为分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时日志的分类结果;
第一异常检测模块,设置为分别根据所述分类结果对应的异常检测方式对所述日志进行异常检测,得到多个子系统的检测结果;
第二异常检测模块,设置为根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理。
根据本公开的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本公开的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
本公开实施例,获取预设时间段内系统中多个子系统的实时数据;分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时日志的分类结果;分别根据所述分类结果对应的异常检测方式对所述日志进行异常检测,得到多个子系统的检测结果;根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理,可以解决相关技术中相同的异常检测方式不能适应不同子系统,且无法有效排除整个系统异常的问题,对各个子系统中的日志进行分类,不同日志采用不同的异常检测方式进行分析,基于各个子系统的检测结果与实时数据统一对系统进行异常检测处理,便于协助定位异常和故障根因。
附图说明
图1是相关技术中电信保障网管系统数据流向的示意图;
图2是本公开一实施例的系统异常检测处理方法的移动终端的硬件结构框图;
图3是根据本公开一实施例的系统异常检测处理方法的流程图;
图4是本公开一实施例的异常检测系统架构的示意图;
图5是本公开一实施例的南向接收一条告警打印的结构化日志的示意图;
图6是本公开一实施例的kafka处理一条告警的打印结构化日志示意图;
图7是本公开一实施例的北向模块发送给OSS一条告警打印的结构化日志的示意图;
图8是本公开一实施例的中间处理告警打印的半结构化日志的示意图;
图9是本公开一实施例的日志聚合流程的示意图;
图10是本公开一实施例的日志异常标记的示意图;
图11是本公开一实施例的两阶段异常检测流程的示意图;
图12是本公开另一实施例的系统异常检测处理装置的框图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本公开的实施例。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本公开实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图2是本公开一实施例的系统异常检测处理方法的移动终端的硬件结构框图,如图2所示,移动终端可以包括一个或多个(图2中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图2所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图2中所示更多或者更少的组件,或者具有与图2所示不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本公开实施例中的系统异常检测处理方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及业务链地址池切片处理,”实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于上述移动终端或网络架构的系统异常检测处理方法,应用于终端,所述终端通过双连接(Dual Connection,简称为DC)接入源区域的当前主节点MN小区与当前辅节点SN小区,图3是根据本公开一实施例的系统异常检测处理方法的流程图,如图3所示,该流程至少包括如下步骤:
步骤S302,获取预设时间段内系统中多个子系统的实时数据;
步骤S304,分别对多个子系统的实时数据中的实时日志进行分类,得到多个子系统的实时日志的分类结果;
本实施例中,上述步骤S304具体可以包括:根据日志来源分别将多个子系统的实时日志分类为:操作系统日志、基础服务日志、应用日志。
本实施例中的实时数据至少包括实时日志、微服务的弹缩情况、微服务的运行资源所属范围以及微服务之间的调用消耗时间。
步骤S306,分别根据分类结果对应的异常检测方式对日志进行异常检测,得到多个子系 统的检测结果;
本实施例中,上述步骤S306具体可以包括:对多个子系统中每个子系统的实时日志执行以下操作,以得到多个子系统的检测结果,其中,正在执行的实时日志称为当前日志:当前日志为操作系统日志或基础服务日志时,通过当前日志的关键字段确定当前日志的检测结果;当日志为应用日志时,将当前日志输入预先训练好的分类检测模型中,得到分类检测模型输出的当前日志的检测结果。
步骤S308,根据多个子系统的检测结果与多个子系统的实时数据对系统进行异常检测处理。
本实施例中,上述步骤S308具体可以包括:将多个子系统的检测结果与多个子系统的实时数据输入预先训练好的目标异常检测模型中,得到目标异常检测模型输出的系统的目标异常检测结果。
通过上述步骤S302至S308,可以解决相关技术中相同的异常检测方式不能适应不同子系统,且无法有效排除整个系统异常的问题,对各个子系统中的日志进行分类,不同日志采用不同的异常检测方式进行分析,基于各个子系统的检测结果与实时数据统一对系统进行异常检测处理,便于协助定位异常和故障根因。
在一实施例中,在将当前日志输入预先训练好的分类检测模型中,得到分类检测模型输出的当前日志的检测结果之前,上述方法还包括:确定当前日志为结构化日志,具体的,判断日志是否为非结构化日志或半结构化日志;当日志为非结构化日志或半结构化日志时,将日志转换为结构化日志;将当前日志进行向量化处理,得到日志向量;按照日志向量的关键字段对日志向量进行聚合,得到当前日志的多个调用链。
在另一实施例中,在上述步骤S308之前,上述方法还包括:获取预定数量的多个子系统的历史数据以及对应的系统的异常检测结果,其中,历史数据至少包括历史日志、微服务的弹缩情况、微服务的运行资源所属范围以及微服务之间的调用消耗时间;分别对预设数量的多个子系统的历史数据中的历史日志进行分类,得到多个子系统的历史日志的分类结果;分别根据分类结果对应的异常检测方式对历史日志进行异常检测,得到预定数量的多个子系统的检测结果;根据预定数量的多个子系统的检测结果、预定数量的多个子系统的数据以及对应的系统的异常检测结果对初始异常检测模型进行训练,得到训练好的目标异常检测模型,进一步地,使用预定数量的多个子系统的检测结果、预定数量的多个子系统的历史数据以及对应的系统的异常检测结果对初异常检测模型进行训练,得到目标异常检测模型,其中,预定数量的多个子系统的检测结果、预定数量的多个子系统的历史数据为初始异常检测模型的输入,训练好的目标异常检测模型输出的系统的目标异常检测结果与实际对应的系统的异常检测结果满足预设目标函数。
在另一实施例中,在根据多个子系统的检测结果与多个子系统的实时数据对系统进行异常检测处理之后,所述方法还包括:当系统的异常检测结果为存在异常时,根据多个子系统的检测结果对异常进行故障根因定位处理。
本实施例对系统中的历史日志,通过数据挖掘和机器学习,得到多个模型,对实时日志处理时,对不同日志按照不同的模型进行向量化,按照各自的模型进行深度学习分析处理,再统一集中分析,以协助定位异常和根因。先收集不同模块不同系统的日志,然后按照不同 的日志格式和用途进行分类标签,设计不同的工具进行相应的处理,再进行集中分析,在异常出现时协助定位根因。
图4是本公开一实施例的异常检测系统架构的示意图,如图4所示,本实施例包括:日志分类器、日志转换器、工具分配器、异常检测器、机器学习器。初始,对网管的日志和模块和相关的异常知识库、服务调用链都进行初始化设置。具体包括:
步骤S401,日志分类器对历史日志数据进行日志分类;
步骤S402,日志转换器将历史日志转换为结构化日志;
步骤S403,日志向量化和聚合;
步骤S404,机器学习器训练异常检测模型;
步骤S405,评估是否模型是否训练完成,在判断结果为是的情况下,执行步骤S407,否则执行步骤S406;
步骤S406,机器学习器调参(”调整异常检测模型的参数);
步骤S407,发布模型(”训练好的异常检测模型);
步骤S408,日志分类器对实时日志数据进行日志分类;
步骤S409,日志转换器将实时日志转换为结构化日志;
步骤S410,日志向量化和聚合;
步骤S411,工具分配器对不同日志获取对应的异常检测器;
步骤S412,异常检测器通过异常检测模型检测系统异常;
步骤S413,负载定位根因。
然后,对日志按照设计域定义进行分类,然后对日志转换成结构化日志后进行向量化,再聚合,对各个模块的异常进行机器学习,得到分类模型(如果评估不过需要重新调参重新学习),再对整体系统进行机器学习。
推理侧根据学习到的模型进行判断,然后辅助根因定位。
日志分类器对日志进行分类,日志转换器对把非结构化日志和半结构化日志转换成结构化日志,并实现向量化和聚类,工具分配器对不同的日志分配不同的检测工具或方法,机器学习器实现机器学习训练,异常检测器实现两阶段异常检测和初步定位。
本方案实现的详细步骤如下:
步骤1:建立初始知识库;
知识库分为:系统调用链,异常知识库和结构化日志的模板库。
系统调用链包括所有微服务之间的调用关系和传播关系,以及应用微服务名字、进程名、线程列表和日志文件名字。
异常知识库包括异常字典和异常超参数和故障堆栈模式。
异常字典包括如“FATAL”,“Error”,“Out Of Memory”这些常见系统错误,异常超参数包括Java虚拟机(Java Virtual Machine,简称为JVM)GC,有导致应用停顿的FULL GC,和新生代GC超过预设时间,如2秒等。
故障堆栈不一定是异常,但它对定位异常很有帮助,所以需要识别什么是堆栈日志,因此需要有一定字段或者模式来识别故障堆栈,如通过NB(Naive Bayes朴素贝叶斯)来训练和识别含有“Caused by:”和“at”字样的日志段落。
结构化日志模板库,即为了后续对结构化日志和非结构化日志以及半结构化日志的识别进行模板定义,如JSON结构日志和有明确字段定义的日志(如操作系统日志和GC日志以及应用调用Log4J和Logback输出的日志)。
步骤2:日志分类器对日志进行分类;
日志分类器对实时日志或训练数据(历史日志数据)进行分类,按照日志来源分类,如,是操作系统日志,是基础服务日志如数据库或kafka日志,还是应用日志。
步骤3:日志向量化和日志聚合;
日志向量化之前,需要把半结构化日志转换成结构化日志。所谓结构化日志,按狭义的定义就是通过JSON(JavaScript Object Notation,一种结构化数据)格式定义的日志,按广义的理解就是可以按照一定模板可以提取内容的日志,这里按广义理解泛指可以按照一定模板提取的日志,如图5、6、7所示的日志。
现在的日志,一般来说,底层支撑的操作系统或者基础服务的日志都是结构化日志,而上层应用日志,一般可以分为半结构化日志和非结构化日志。非结构化日志就是应用程序打印日志时,没有调用标准日志库,如Java应用程序没有调用log4j或者Logback等日志模块,而是自己输出的调试打印日志,随着程序的规范化和日志的规范化,这种日志都非常少了,忽略不计。半结构化日志,就是虽然应用程序如Java程序调用了log4j或Logback,能够有标准的时间戳,日志级别,类、函数、线程ID和具体调试内容,前面都是结构化了,但调试内容也是关键信息,而内容并不一定是结构化的,所以这些日志可以称作半结构化日志,图8是本公开一实施例的中间处理告警打印的半结构化日志的示意图,如图8所示,日志展示了半结构化日志中的非结构化信息。
但在大多数日志检测中,处理最多的正是这些半结构化日志的信息处理。非结构化日志转换成结构化日志有较成熟的方案,如Logstash等开源工具等,利用Grok编写正则表达式从非结构化数据中派生出结构,如果日志业务复杂,正则表达式编写并不容易,但这种日志偏少,可以使用分词后,再结合硬编码方式解决。最后对所有转换后的结构化日志进行向量化,向量化以关键特征组成,如日志关键内容,以告警数据举例,告警标题、告警发生时间、告警发生位置,它们三个可以通过哈希编码组成向量的一个维度,也可以组成三个维度,另外,本条目志记录时间也是向量的一个维度。
向量化之后对日志进行聚合。所谓聚合,是指把同一性质的日志聚合在一起,如一条南向收到的下级网元告警,它经过转换后,通过北向发送给OSS,那么这条告警除了南向北向记录日志外,还有kafka和数据库记录日志,中间告警处理模块也可能会记录,这些日志如果不聚合,和其它日志(如另外一条告警或者性能或者资源日志)会混合在一起,经过聚合,能够清晰看到这条数据的所有流程处理。
聚合按照日志向量的关键维度聚合,以记录告警的日志为例,通过告警向量中唯一标识(如告警标题+告警发生时间+告警发生位置)这个维度进行聚合,这样,记录日志时间的那个维度,能够看到这个告警在不同流程处理中的时间过程。
步骤4:异常定义和标记;
对一个软件系统来说,异常分为两类,一个是功能方面的异常,一个是非功能方面异常。在系统中,以示意图1为例,OSS系统收到北向的告警或者性能文件缺失,就属于功能方面异常,而收到但延迟了,则属于非功能方面异常。用户操作时系统返回错误,则属于功能异 常,而用户操作时感觉系统卡顿等,可能内部有功能异常的可能,也有非功能异常的可能。
一个软件系统又由各子系统组成,而系统的异常肯定是一个或多个应用模块或子系统的异常导致,但子系统的异常又不一定会导致整个系统的异常。
比如一个应用作为网络时间协议(Network Time Protocol,简称为NTP)Client需要向时钟服务器NTP Server同步时钟,但某个时间可能无法连通NTP Server,这个NTP Client可能会打印Error异常,但并不一定会影响整个系统的业务运行。
要对系统和各子系统进行异常感知,如果完全是依靠用户操作感知显然不完整,毕竟用户感知不容易量化,不同用户感知也不同;另外如果仅仅依靠日志中是否存在异常字典(第一步定义)的词语或者有异常堆栈来判断也是不完整的,毕竟有些日志只是打印一些“Error”等错误信息但能处理这些错误继续正常运行而不影响整个流程,也有些程序在出现错误时并未打印异常而该功能实际已经出现异常。
因此,需要对系统是否异常通过机器学习进行判断,而机器学习有监督学习的标签,在本方案中,除了用户或测试人员在验证功能模块感知明显的功能异常外,还要把第三步日志聚合后的关键向量,如某条告警,从南向日志到北向日志通过对齐比较看是否完整而标记异常,通过日志时间戳计算是否延迟超过OSS标准,然后整合一段时间内延迟条数是否超过OSS要求(如延迟不超过1%)来标记是否异常。当然异常的种类多种多样,为了简化,只需要二分类,只要是系统不正常,即为异常。
步骤5:训练异常检测模型;
虽然感知了异常,但需要更进一步发现哪里出现了问题,哪个或者哪几个子系统、微服务出现问题导致整个系统出现异常。异常检测器,需要通过一定规则或者模型对系统进行异常检测,而这个模型是通过对历史日志数据和相应的标签训练出来的。先对不同的子系统进行异常检测判断,再汇总后进行一个总体的学习训练得到最终系统是否异常的模型。
关于整个系统的异常,并不是所有的异常信息都要通过机器学习来判断,如操作系统或底层支撑服务的异常日志,一般都是简单明了,可以直接通过关键字段,如“Fatal Error”等字样进行判断;并不是所有日志文件中的异常打印都是真正的异常,比如一个应用或者一个支撑服务作为NTP Client需要向时钟服务器NTP Server同步时钟,但某个时间可能无法连通NTP Server,这个NTP Client可能会打印Error异常,但并不影响整个系统的业务运行;并不是所有的异常都能够通过日志文本的异常打印直接得到,如延迟上报,可能是中间过程处理缓慢(如程序有bug),而应用程序只会打印处理的时间戳,不会打印异常;并不是有些应用功能异常,整个系统就异常,以图1为例,如果外部系统OSS的资源接收失效了,那么系统的北向发送资源数据也会发生异常,但整个数据转换网管系统依然可能是正常的。
因此,异常检测器,需要先对不同的应用子系统进行异常检测判断,再汇总后进行一个总体的学习训练得到最终系统是否异常的模型,具体包括:给定时间窗口、调用链以及传播链得到相应模块日志;根据第一步中日志来源分类,对不同日志调用不同的异常检测模型进行检查;对操作系统和基础服务,用特殊字段,如“FATAL”等,进行检测,如有,则标识异常;对应用日志,对结构化和向量化后的历史日志数据进行聚合后,可以按照功能异常和非功能异常进行检测,可以按照结束日志和起始日志的向量进行对比;对应用日志,可以通过朴素贝叶斯(NB)或支持向量机(Support Vector Machine,简称为SVM)进行二分类学习得到模型;对应用程序运行的虚拟机日志进行检测,如GC日志,判断是否有FULL GC和新生 代GC超过预设时间(如2秒),若有,则标识异常;对应调用链之间,用正常标签的日志的时间差,按正态分布统计每个调用的时间,并以方差为特征;统一建模,按整体系统再度建模训练。正如上面所述局部异常未必会导致整个系统异常,因此需要对整体系统再度训练,标签还是整个系统是否异常,需要训练的参数特征如下:底层系统是否异常,基础服务是否异常,应用子系统是否异常,微服务是否弹缩,微服务运行资源数据所属范围(即在哪个范围)(CPU/内存/IO等),调用链服务之间调用消耗时间N方差之内(N=1,2,3)。训练好的模型为统一异常检测模型。
步骤6:异常检测器对实时业务进行两阶段检测和模块定位。
当实时系统运行时,业务检测器先进行两阶段检测,再进行异常初步定位,以协助最终的根因分析。
两阶段检测,第一阶段,根据从步骤5训练到的模型,对调用链中的各自服务的一个时间窗口的日志,工具分配器对不同的日志按日志分类柔性分配不同的工具和模型,按各自学习的模型进行判断是否异常;然后按照各自特征处理,再进行统一异常检测,得到系统发生异常结论。
两阶段检测在判断系统是否有非功能异常更为重要,仍然以告警从南向到北向为例,明确系统没有告警缺漏上报的问题,但系统有告警时延的异常后,即在明确南向接收数据没有缺失但有延时,根据调用链,按调用消耗时间偏离方差大小排序,再按各模块异常发生进行模块定位。
虽然是两阶段检测,但不同阶段根据不同日志柔性分配不同工具或不同方法,相比人工检测定位或者使用单一工具单一方法检测定位,其效率显然大幅提高。
步骤7:确定故障根因。
初步定位后,再结合第一步得到的异常堆栈数据(如有)以及代码进行真实的根因分析定位故障真实原因。
其中,日志分类定义和日志转换模板、常用字典、日志工具分配和机器学习训练都可以在设计态界面定义,运行态即具体执行分类、转换、工具分配等。
本实施例针对不同模块应用类型,不同的日志不同的异常检测手段,能够更准确判断系统是否异常;两阶段异常判断,能更准确判断系统是否异常;对没有告警异常只有日志的运维系统有比较好的参考作用;通过柔性分配不同方法和工具能够更快速异常定位。
日志向量化和日志聚合,图9是本公开一实施例的日志聚合流程的示意图,如图9所示,包括:
S901,准备日志;
S902,定义模板并提取相关数据;
S903,对非结构化记录内容部分分词,并提前关键字段;
S904,对这条目志语句拆分维度;
S905,对这条目志语句进行向量化;
S906,按照关键维度进行聚合。
对所有可能的关联日志,半结构化日志按照模板和关键字段转换成结构化日志,再向量化。向量的维度按照日志记录时间、级别(DEBUG/INFO/WARN/ERROR/FATAL)、调用类方法、线程名和微服务名,和日志信息关键维度。
日志信息关键维度,以告警和性能数据举例。对告警数据来说关键维度就是告警的标题/发生时间/发生的网元/相关ID等,它们可以按不同维度组合,也可以组合成一个维度,但它们共同组成关键维度。同样,对性能数据来说,就是统计网元/统计时间/统计数据文件名等。向量化后,按照关键维度进行聚合,这样能够把一个事件依照调用链从头到尾的发生的全部日志聚合起来。
标记应用异常,对底层系统和基础服务来说,异常标记相对容易,有明显的错误打印,如FATAL等字段,但对应用层来说,即使有错误堆栈也并不能说明有异常,因此需要通过调用链结合日志来分析。
根据调用链获取相关日志,在日志向量化和聚合后,获取起始日志和结束日志向量,如果获取失败,则中间肯定有异常,即使都能获取,那么日志记录的结束时间减去起始日志的时间出现异常,比如如果大于某个设计或者统计出来的数字,那么是肯定有异常,当然小于也不一定正常,中间可能有异常,直接跳过正常运行就结束了。
在这种情况下,有以下方式:
按照功能是否正常完成来标记这次应用调用是否异常。
可以统计所有调用的时间差,按正态分布来确定,极端偏差定位异常;虽然单个事件,如单个告警从南向到北向的数据体量是不变的,但如果整体系统其它事件的体量变化可能会导致时间出现偏差,所以相对来说不准确。
某些功能,类似图1这种直接从南向上报到北向数据的,有运营商规定的标准时间,如果超过,也标记为异常。此方式在电信行业的网管系统中较为常见。
图10是本公开一实施例的日志异常标记的示意图,如图10所示,包括:
S1001,准备日志;
S1002,获取聚合后日志向量;
S1003,获取起始日志向量;
S1004,获取结束日志向量;
S1005,是否获取成功,在判断结果为否的情况下,执行步骤S1006,在判断结果为是的情况下,执行步骤S1008;
S1006,粒度缺失功能异常,之后执行步骤S1007;
S1007,按聚合向量定位缺失模块,之后执行步骤S1013;
S1008,结束时间-起始时间
S1009,判断时间差是否超标,在判断结果为否的情况下,执行步骤S1010,在判断结果为是的情况下,执行步骤S1011;
S1010,标记正常;
S1011,延迟非功能异常;
S1012,统计聚合向量定位耗费时长最大模块;
S1013,标记异常.
子系统异常检测模型,图11是本公开实施例的两阶段异常检测流程的示意图,如图11所示,包括:
S1100,准备时间段内的日志和资源数据;
S1101,工具分配器根据日志类别柔性分配不同工具和模型;
S1102,通过关键字段检测基础服务是否异常;
S1103,检测应用虚拟机是否有异常;
S1104,使用日志聚合判断应用是否异常;
S1105,判断应用是否异常;
S1106,使用线性回归模型判断应用子模块调用链时长是否异常;
S1107,整体判断特征工程;
S1108,整体判断系统异常;
S1109,获取异常子系统应用模块;
S1110,协助根因定位。
整个系统是否异常是由各个子系统服务的异常共同影响决定的。先判断各子系统是否异常,再判断整个系统是否异常。如果整个系统判断异常后,再按子系统异常来协助寻找根因。对基础支撑系统,如微服务系统,直接通过日志中关键字段判断。对基础服务,如FTP,数据库,也直接通过日志中关键字段判断。对GC日志,通过是否有FULLGC和新生代GC超出标准时间来判断异常。对按功能划分的各应用子系统,按照起始结束的日志向量是否完整来判断功能异常,按照起始结束的时间差来判断非功能异常。
对各应用的各个子模块,需要用二分类来学习,特征工程按照日志的记录信息,不包括记录日志时间,其它信息,包括级别,调用类方法(有些异常会在异常处理类中出现,所以需要它),和线程(有些异常会在异常处理线程中打印日志,所以需要它),和日志信息各个字段,按照NB(朴素贝叶斯)来学习分类模型。
对调用链按调用时间分布来判断,对同样的业务数据,超长时间或者超短时间都不正常,因为可能中间出现异常就直接跳出异常,对不同的业务数据,如超大规模的告警风暴和稀疏的告警上报,那么时间调用不一致,因此需要按照规模进行线性回归判断是否正常,线性回归到特征可以用规定时间内的内存、线程和业务数量(如上报的告警)和日志大小(一般来说,业务量大,日志也越大),通过这些特征来学习线性回归趋势模型。
整体系统异常,部分子系统有异常,对整个系统是否异常可能有绝对因素,也可能并不重要,因此需要通过机器学习来再次学习模型。
这里的特征工程的特征定义如下:
各个支撑子系统(包括支撑系统和基础服务)是否异常,
各个应用子系统是否异常
系统微服务数量是否超过标准
微服务是否弹缩
各微服务的资源(内存/CPU/IO),其中资源数据本身是线性的,可以按照20%一个台阶划分成5个维度,如CPU,消耗0-20%,20%-40%,40%-60%,60%-80%,90%-100%,处在哪个维度则为1,其它维度为0,来定义。
通过SVM来学习二分类模型。
在实时运行时,通过二分类模型判断整个系统是否异常。
如果异常,再按子系统应用是否异常来辅助定位根因。
本公开实施例还提供了一种系统异常检测处理装置,图12是本公开另一实施例的系统异常检测处理装置的框图,如图12所示,包括:
第一获取模块122,设置为获取预设时间段内系统中多个子系统的实时数据;
第一分类模块124,设置为分别对多个子系统的实时数据中的实时日志进行分类,得到多个子系统的实时日志的分类结果;
第一异常检测模块126,设置为分别根据分类结果对应的异常检测方式对日志进行异常检测,得到多个子系统的检测结果;
第二异常检测模块128,设置为根据多个子系统的检测结果与多个子系统的实时数据对系统进行异常检测处理。
在一实施例中,所述第一分类模块124,还设置为
根据日志来源分别将多个子系统的实时日志分类为:操作系统日志、基础服务日志、应用日志。
在一实施例中,第一异常检测模块126,还设置为
对多个子系统中每个子系统的实时日志执行以下操作,以得到多个子系统的检测结果,其中,正在执行的实时日志称为当前日志:
当前日志为所述操作系统日志或基础服务日志时,通过当前日志的关键字段确定当前日志的检测结果;
当日志为应用日志时,将当前日志输入预先训练好的分类检测模型中,得到分类检测模型输出的当前日志的检测结果。
在一实施例中,所述装置还包括:
确定模块,设置为确定当前日志为结构化日志;
向量化处理模块,设置为将当前日志进行向量化处理,得到日志向量;
聚合模块,设置为按照日志向量的关键字段对所述日志向量进行聚合,得到当前日志的多个调用链。
在一实施例中,上述的确定模块,还设置为
判断日志是否为非结构化日志或半结构化日志;
当日志为非结构化日志或半结构化日志时,将日志转换为结构化日志。
在一实施例中,所述第二异常检测模块128,还设置为
将多个子系统的检测结果与多个子系统的实时数据输入预先训练好的目标异常检测模型中,得到目标异常检测模型输出的系统的目标异常检测结果。
在一实施例中,上述的装置还包括:
第二获取模块,设置为预定数量的多个子系统的历史数据以及对应的系统的异常检测结果;
第二分类模块,设置为分别对预设数量的多个子系统的历史数据中的历史日志进行分类,得到多个子系统的历史日志的分类结果;
第三异常检测模块,设置为分别根据分类结果对应的异常检测方式对历史日志进行异常检测,得到预定数量的多个子系统的检测结果;
训练模块,设置为根据预定数量的多个子系统的检测结果、预定数量的多个子系统的数 据以及对应的系统的异常检测结果对初始异常检测模型进行训练,得到训练好的所述目标异常检测模型。
在一实施例中,上述的训练模块,还设置为
使用预定数量的多个子系统的检测结果、预定数量的多个子系统的历史数据以及对应的系统的异常检测结果对初异常检测模型进行训练,得到目标异常检测模型,其中,该预定数量的多个子系统的检测结果、预定数量的多个子系统的历史数据为初始异常检测模型的输入,训练好的目标异常检测模型输出的系统的目标异常检测结果与实际对应的系统的异常检测结果满足预设目标函数。
在一实施例中,所述装置还包括:
根因定位模块,设置为当系统的异常检测结果为存在异常时,根据多个子系统的检测结果对异常进行故障根因定位处理。
本公开的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本公开的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。
以上所述仅为本公开的优选实施例而已,并不设置为限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (12)

  1. 一种系统异常检测处理方法,包括:
    获取预设时间段内系统中多个子系统的实时数据;
    分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时日志的分类结果;
    分别根据所述分类结果对应的异常检测方式对所述日志进行异常检测,得到所述多个子系统的检测结果;
    根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理。
  2. 根据权利要求1所述的方法,其中,分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时日志的分类结果包括:
    根据日志来源分别将所述多个子系统的实时日志分类为:操作系统日志、基础服务日志、应用日志。
  3. 根据权利要求2所述的方法,其中,分别根据所述分类结果对应的异常检测方式对所述实时日志进行异常检测,得到多个子系统的检测结果包括:
    对所述多个子系统中每个子系统的实时日志执行以下操作,以得到所述多个子系统的检测结果,其中,正在执行的实时日志称为当前日志:
    当所述当前日志为所述操作系统日志或所述基础服务日志时,通过所述当前日志的关键字段确定所述当前日志的检测结果;
    当所述日志为应用日志时,将所述当前日志输入预先训练好的分类检测模型中,得到所述分类检测模型输出的所述当前日志的检测结果。
  4. 根据权利要求3所述的方法,其中,在将所述当前日志输入预先训练好的分类检测模型中,得到所述分类检测模型输出的所述当前日志的检测结果之前,所述方法还包括:
    确定所述当前日志为结构化日志;
    将所述当前日志进行向量化处理,得到日志向量;
    按照日志向量的关键字段对所述日志向量进行聚合,得到所述当前日志的多个调用链。
  5. 根据权利要求4所述的方法,其中,确定所述日志为结构化日志包括:
    判断所述日志是否为非结构化日志或半结构化日志;
    当所述日志为所述非结构化日志或所述半结构化日志时,将所述日志转换为结构化日志。
  6. 根据权利要求1所述的方法,其中,根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理包括:
    将所述多个子系统的检测结果与所述多个子系统的实时数据输入预先训练好的目标异常检测模型中,得到所述目标异常检测模型输出的所述系统的目标异常检测结果。
  7. 根据权利要求6所述的方法,其中,在根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理之前,所述方法还包括:
    获取预定数量的所述多个子系统的历史数据以及对应的系统的异常检测结果;
    分别对所述预设数量的所述多个子系统的历史数据中的历史日志进行分类,得到所述多个子系统的历史日志的分类结果;
    分别根据所述分类结果对应的异常检测方式对所述历史日志进行异常检测,得到预定数量的所述多个子系统的检测结果;
    根据预定数量的所述多个子系统的检测结果、预定数量的所述多个子系统的数据以及对应的所述系统的异常检测结果对初始异常检测模型进行训练,得到训练好的所述目标异常检测模型。
  8. 根据权利要求7所述的方法,其中,根据预定数量的所述多个子系统的检测结果、预定数量的所述多个子系统的数据以及对应的所述系统的异常检测结果对初始异常检测模型进行训练,得到训练好的所述目标异常检测模型包括:
    使用预定数量的所述多个子系统的检测结果、预定数量的所述多个子系统的历史数据以及对应的所述系统的异常检测结果对初异常检测模型进行训练,得到所述目标异常检测模型,其中,所述预定数量的所述多个子系统的检测结果、所述预定数量的所述多个子系统的历史数据为所述初始异常检测模型的输入,训练好的所述目标异常检测模型输出的所述系统的目标异常检测结果与实际对应的所述系统的异常检测结果满足预设目标函数。
  9. 根据权利要求1至8中任一项所述的方法,其中,在根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理之后,所述方法还包括:
    若所述系统的异常检测结果为存在异常,根据所述多个子系统的检测结果对所述异常进行故障根因定位处理。
  10. 一种系统异常检测处理装置,包括:
    第一获取模块,设置为获取预设时间段内系统中多个子系统的实时数据;
    第一分类模块,设置为分别对所述多个子系统的实时数据中的实时日志进行分类,得到所述多个子系统的实时日志的分类结果;
    第一异常检测模块,设置为分别根据所述分类结果对应的异常检测方式对所述日志进行异常检测,得到多个子系统的检测结果;
    第二异常检测模块,设置为根据所述多个子系统的检测结果与所述多个子系统的实时数据对所述系统进行异常检测处理。
  11. 一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至9任一项中所述的方法。
  12. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至9任一项中所述的方法。
PCT/CN2022/104378 2021-09-29 2022-07-07 一种系统异常检测处理方法及装置 WO2023050967A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111152914.6 2021-09-29
CN202111152914.6A CN115905417A (zh) 2021-09-29 2021-09-29 一种系统异常检测处理方法及装置

Publications (1)

Publication Number Publication Date
WO2023050967A1 true WO2023050967A1 (zh) 2023-04-06

Family

ID=85729435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104378 WO2023050967A1 (zh) 2021-09-29 2022-07-07 一种系统异常检测处理方法及装置

Country Status (2)

Country Link
CN (1) CN115905417A (zh)
WO (1) WO2023050967A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210512A (zh) * 2019-04-19 2019-09-06 北京亿阳信通科技有限公司 一种自动化日志异常检测方法及系统
CN110502412A (zh) * 2019-07-01 2019-11-26 无锡天脉聚源传媒科技有限公司 一种服务器日志处理方法、系统、装置及存储介质
CN112364285A (zh) * 2020-11-23 2021-02-12 北京八分量信息科技有限公司 基于ueba建立异常侦测模型的方法、装置及相关产品
US20210271582A1 (en) * 2018-06-28 2021-09-02 Zte Corporation Operation and maintenance system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210271582A1 (en) * 2018-06-28 2021-09-02 Zte Corporation Operation and maintenance system and method
CN110210512A (zh) * 2019-04-19 2019-09-06 北京亿阳信通科技有限公司 一种自动化日志异常检测方法及系统
CN110502412A (zh) * 2019-07-01 2019-11-26 无锡天脉聚源传媒科技有限公司 一种服务器日志处理方法、系统、装置及存储介质
CN112364285A (zh) * 2020-11-23 2021-02-12 北京八分量信息科技有限公司 基于ueba建立异常侦测模型的方法、装置及相关产品

Also Published As

Publication number Publication date
CN115905417A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
Zhang et al. Robust log-based anomaly detection on unstable log data
US9299031B2 (en) Active learning on statistical server name extraction from information technology (IT) service tickets
CN111475370A (zh) 基于数据中心的运维监控方法、装置、设备及存储介质
US20190108112A1 (en) System and method for generating a log analysis report from a set of data sources
US20150347923A1 (en) Error classification in a computing system
US11693726B2 (en) Error documentation assistance
AU2019275633B2 (en) System and method of automated fault correction in a network environment
KR102068622B1 (ko) 이기종 네트워크 보안시스템을 위한 빅데이타 분석기반의 지능형 장애예측 시스템
US11886276B2 (en) Automatically correlating phenomena detected in machine generated data to a tracked information technology change
CN113254254B (zh) 系统故障的根因定位方法、装置、存储介质及电子装置
CN114785666B (zh) 一种网络故障排查方法与系统
Shah et al. Towards benchmarking feature type inference for automl platforms
CN112966957A (zh) 一种数据链路异常定位方法、装置、电子设备及存储介质
CN116361147A (zh) 测试用例根因定位方法及其装置、设备、介质、产品
CN116955604A (zh) 一种日志检测模型的训练方法、检测方法、装置
CN116561748A (zh) 一种组件子序列相关性感知的日志异常检测装置
CN115617614A (zh) 基于时间间隔感知自注意力机制的日志序列异常检测方法
Chen et al. Deep attentive anomaly detection for microservice systems with multimodal time-series data
WO2023050967A1 (zh) 一种系统异常检测处理方法及装置
CN111309585A (zh) 日志数据测试方法及装置、系统、电子设备、存储介质
CN114491044A (zh) 日志的处理方法及装置
CN114064434A (zh) 一种日志异常的预警方法、装置、电子设备及存储介质
WO2023089356A1 (en) Network attribute analysis
CN117667497B (zh) 一种调度集中系统的自动化故障分析方法及系统
Zaojian et al. Semi-supervised Power Microservices Log Anomaly Detection Based on BiLSTM and BERT with Attention

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE