WO2021213247A1 - 一种异常检测方法及装置 - Google Patents

一种异常检测方法及装置 Download PDF

Info

Publication number
WO2021213247A1
WO2021213247A1 PCT/CN2021/087603 CN2021087603W WO2021213247A1 WO 2021213247 A1 WO2021213247 A1 WO 2021213247A1 CN 2021087603 W CN2021087603 W CN 2021087603W WO 2021213247 A1 WO2021213247 A1 WO 2021213247A1
Authority
WO
WIPO (PCT)
Prior art keywords
abnormal
kpis
time point
kpi
matrix
Prior art date
Application number
PCT/CN2021/087603
Other languages
English (en)
French (fr)
Inventor
胡永昌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021213247A1 publication Critical patent/WO2021213247A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/08Testing, supervising or monitoring using real traffic

Definitions

  • This application relates to the field of communication technology, and in particular to an abnormality detection method and device.
  • network change operations are very frequent, involving many network elements.
  • network change operations can include upgrades, cutovers, capacity expansions, etc.
  • the core network involves many network elements.
  • 5G 5th Generation
  • a suitable anomaly detection function is required to ensure that network anomalies are discovered as soon as possible within a period of time after the network change operation (observation period), and maintenance is performed to ensure the success of the network change, or to stop the loss in time ( Such as rollback operation).
  • the existing anomaly detection algorithms all support daily network scenarios, and there is no anomaly detection method specifically suitable for network change scenarios.
  • the present application provides an anomaly detection method and device, which are used to propose an anomaly detection method suitable for network change scenarios, so as to realize faster detection of network anomalies in network change scenarios.
  • this application provides an anomaly detection method.
  • the method includes: determining a first matrix according to a first value of a plurality of key performance indicators (KPIs) of a first business and a first neural network model,
  • the first matrix includes the difference between the predicted values of the multiple KPIs at N time points and the first values of the multiple KPIs; and the N time points are respectively determined according to the first matrix
  • KPIs key performance indicators
  • the abnormal result corresponding to any time point is whether the first business is abnormal at any one time point; after that, the abnormal result at each abnormal time point is determined according to the abnormal result and the first matrix.
  • the abnormality of multiple KPIs and determine the abnormality type of the first business at each abnormal time point according to the abnormality of the multiple KPIs at each abnormal time point; wherein, the abnormality of the multiple KPIs
  • the predicted value is obtained based on the first neural network model; the first neural network model is determined based on the historical values of the multiple KPIs of the first service; the first service is any of the multiple services One business; N is an integer greater than or equal to 1; the abnormality of any KPI at each abnormal time point is the percentage of the difference corresponding to the any KPI to the sum of the difference corresponding to the multiple KPIs.
  • the problem of KPI zero drop and distortion in the network change scenario can be solved, and the network abnormality can be found quickly in the network change scenario, so that the operation and maintenance personnel can find the network abnormality during the change period as soon as possible and stop the loss in time.
  • the multiple KPIs are classified according to the business, and the KPIs corresponding to the multiple businesses are obtained. , And select the KPI corresponding to any one of the KPIs corresponding to each of the multiple businesses as the multiple KPIs of the first business.
  • the anomaly detection of different services can complement and interfere with each other, and the granularity of anomaly detection can be controlled at the service level, which facilitates the location of anomalies and makes the location of anomalies more accurate.
  • the first matrix is determined according to the first values of the multiple KPIs of the first business and the first neural network model.
  • the specific method may be: generating the second matrix based on the first values of the multiple KPIs ,
  • the second matrix includes the second values of multiple KPIs at N time points in each of the M acquisition windows before the current acquisition window; and then the second matrix is input to the first neural network Model to obtain the predicted values of the multiple KPIs at N time points; finally determine the difference between the predicted values of the multiple KPIs at N time points and the first values of the multiple KPIs, and generate the first value A matrix; where M is an integer greater than or equal to 1.
  • the first matrix can be accurately obtained, that is, the residuals between the actual data and the predicted data of the multiple KPIs can be obtained, so as to accurately locate the abnormality in the follow-up.
  • the abnormal results corresponding to the N time points are determined according to the first matrix.
  • the specific method may be: based on the corresponding multiple KPIs at each time point in the first matrix The difference determines a KPI comprehensive abnormal value corresponding to each time point; the KPI comprehensive abnormal value corresponding to each time point and the first threshold value are used to determine whether the first business is abnormal at each time point; when a time point When the comprehensive abnormal value of the KPI above is greater than the first threshold, it is determined that the first business is abnormal at the point in time; when the comprehensive abnormal value of the KPI at a point in time is less than or equal to the first threshold, it is determined to be At the time point, the first service is not abnormal; wherein, the first threshold value is the maximum value of the second threshold value and the third threshold value; the second threshold value is the plurality of services based on the first service The non-abnormal historical value of the KPI is obtained; the third threshold is determined based on the first value of the multiple KPIs at the N time points and the prese
  • the third threshold is determined based on the first values of multiple KPIs at the N time points and a preset abnormality percentage.
  • the specific method may be: corresponding to the N time points
  • the comprehensive abnormal values of the N KPIs are sorted from large to small to obtain the ranked KPI comprehensive abnormal values; according to the preset abnormal percentage, the target corresponding to the abnormal percentage is determined from the ranked KPI comprehensive abnormal values KPI comprehensive abnormal value; the determined comprehensive abnormal value of the target KPI is used as the third threshold; wherein, the comprehensive abnormal value corresponding to each time point is based on the first of multiple KPIs at the N time points
  • the obtained value is determined by the difference corresponding to the multiple KPIs at each time point in the first matrix.
  • the third threshold value can be accurately obtained by the above method, so that the first threshold value can be determined by flexibly combining the third threshold value and the second threshold value when the abnormality judgment at each time point is performed, so as to suppress false alarms.
  • the abnormality type of the first service is determined according to the abnormality of the multiple KPIs at each abnormal time point.
  • the specific method may be: The abnormality of KPIs is sorted from high to low, and the abnormality of multiple sorted KPIs at each time point is obtained; the abnormality type corresponding to the abnormality of the first H KPIs at each abnormal time point is taken as each abnormal time Click on the abnormal type of the first service; H is an integer greater than or equal to 1.
  • the present application also provides an abnormality detection device, which has the function of realizing the above-mentioned first aspect or each possible design example of the first aspect.
  • the function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules or units corresponding to the above-mentioned functions.
  • the structure of the abnormality detection device may include multiple processing units, such as a first processing unit, a second processing unit, and a third processing unit. These units can perform the first aspect or the first aspect described above.
  • processing units such as a first processing unit, a second processing unit, and a third processing unit.
  • the structure of the anomaly detection device includes a memory and a processor, and the processor is configured to support the anomaly detection device to execute the foregoing first aspect or various possible design examples of the first aspect.
  • the memory is coupled with the processor, and stores the program instructions and data necessary for the abnormality detection device.
  • a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable storage medium stores program instructions.
  • the program instructions run on a computer, the computer executes the first aspect of the embodiments of the present application and its Any possible design.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer.
  • computer-readable media may include non-transitory computer-readable media, random-access memory (RAM), read-only memory (ROM), and electrically erasable In addition to programmable read-only memory (electrically EPROM, EEPROM), CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer.
  • RAM random-access memory
  • ROM read-only memory
  • EEPROM electrically erasable
  • CD-ROM or other optical disk storage magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer.
  • the embodiments of the present application provide a computer program product including computer program code or instructions, which, when run on a computer, enables the computer to implement any one of the possible design methods in the first aspect.
  • the present application also provides a chip, which is coupled with a memory, and is used to read and execute program instructions stored in the memory to implement any one of the possible designs in the first aspect. Methods.
  • Figure 1 is a schematic diagram of a 5G network change scenario provided by this application.
  • FIG. 2 is a flowchart of an abnormality detection method provided by this application.
  • FIG. 3 is a schematic diagram of a KPI grouping provided by this application.
  • FIG. 4 is a schematic diagram of a KPI according to business classification provided by this application.
  • FIG. 5 is a schematic diagram of offline training and online detection provided by this application.
  • FIG. 6 is a schematic diagram of a traditional neural network provided by this application.
  • FIG. 7 is a schematic diagram of generating a neural network provided by this application.
  • FIG. 8 is a schematic diagram of a recurrent neural network provided by this application.
  • FIG. 9 is a schematic diagram of an LSTM network provided by this application.
  • Fig. 10 is a schematic diagram of a composite LSTM neural network provided by this application.
  • FIG. 11 is a schematic diagram of the relationship between the current time point and the previous n time points provided by this application.
  • FIG. 12 is a schematic diagram of data in a current time window and all data in previous n consecutive time windows provided by this application;
  • FIG. 13 is a schematic diagram of an input and output view of a neural network provided by this application.
  • FIG. 14 is a schematic diagram of the input and output of a Composite LSTM neural network provided by this application.
  • FIG. 15 is an example of a composite LSTM network information view provided by this application.
  • FIG. 16 is a schematic diagram of data before and after a network change operation provided by this application.
  • Figure 17 is a schematic diagram of the effect of using Composite LSTM to deal with the KPI zero drop and deformation problems in network change scenarios provided by this application;
  • Figure 18 is a test effect diagram based on multiple types of KPIs provided by this application.
  • FIG. 19 is a schematic diagram of a flow after obtaining the first matrix provided by this application.
  • Figure 20 is a schematic diagram of the principle of Iforest provided by this application.
  • FIG. 21 is a schematic diagram of a threshold provided by this application.
  • FIG. 22 is a schematic diagram of a kind of training data provided by this application, detection data, and KPI comprehensive outliers at each time point;
  • FIG. 23 is a schematic diagram of a sort of abnormality degree of multiple KPIs at abnormal time points provided by this application.
  • FIG. 24 is a schematic diagram of the flow of an abnormality detection method provided by this application.
  • FIG. 25 is a schematic structural diagram of an abnormality detection device provided by this application.
  • FIG. 26 is a structural diagram of an abnormality detection device provided by this application.
  • the embodiments of the present application provide an anomaly detection method and device, which are used to propose an anomaly detection method suitable for network change scenarios, so as to realize faster detection of network anomalies in network change scenarios.
  • the method and device described in the present application are based on the same technical concept. Since the method and the device have similar principles for solving the problem, the implementation of the device and the method can be referred to each other, and the repetition will not be repeated.
  • the key performance indicator (KPI) in the network change scenario has the following characteristics: (1) The success rate and business volume-related KPIs drop to zero at the time of the change operation. Because network change operations (such as network element upgrades) have operations such as resetting or kicking out users, the success rate and business volume-related KPIs are dropped to zero. (2) KPIs related to business volume climb slowly after falling to zero. KPIs are deformed at this stage. For example, after operations such as resetting or kicking out a user, the user will gradually access the network (for example, the kicked out user will rejoin the upgraded network element), and the business volume will slowly climb.
  • KPI key performance indicator
  • the existing anomaly detection algorithms all support daily monitoring scenarios (that is, scenarios where there is no KPI zero drop and climbing deformation), and currently there is no anomaly detection algorithm specifically suitable for network change scenarios. Due to the zero-drop and deformation characteristics of KPIs in the changed scenario, the existing anomaly detection algorithms in daily monitoring scenarios will generate a large number of false positives and false negatives.
  • single indicator anomaly detection is usually performed, but in fact, it is difficult to support the detection of business anomalies only by observing a single indicator.
  • KPIs will occasionally jitter, and single indicator anomaly detection will cause false alarms; another example, even if a local small-scale anomaly occurs, KPIs will be affected, but because the system is more resilient and robust, Business quickly returns to normal, and there is no need to report abnormalities.
  • single indicator abnormality detection is false; for example, sometimes the system has abnormalities, but it is not clearly reflected in the observed KPI or is delayed, and single indicator abnormality detection will miss or delay. Reported.
  • L1 layer main indicators success rate category, etc.
  • L2 layer secondary indicators number of attempts, etc.
  • L3 layer negative indicators flag error code category, pointing to specific failure reasons.
  • the traditional single indicator generally only monitors the key L1 main indicator, but the user base is relatively large and the display is not obvious. There are problems such as missing, false alarm, or delayed reporting in the above-mentioned single indicator abnormal detection.
  • this application proposes an anomaly detection method suitable for network change scenarios to solve the problems of KPI zero drop and deformation in the network change scenario, and at the same time address the defects in the various single indicator anomaly detection mentioned above, to achieve The network change scenario finds network anomalies sooner.
  • At least one refers to one or more, and “multiple” refers to two or more than two.
  • “And/or” describes the association relationship of the associated object, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • the following at least one (item) or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c can mean: a, b, c, a and b, a and c, b and c, or a, b and c, where a, b, c It can be single or multiple.
  • the anomaly detection method provided in the embodiments of the present application is applicable to a network where there is a network change operation, for example, a 5G network or a future communication network, such as 6G.
  • the anomaly detection method is applied to many network elements in the network.
  • the network change operation upgrade, cutover, capacity expansion, etc. will involve the home users of the network elements included in the core network.
  • Server home subscriber server, HSS
  • unified service node unified service node, USN
  • unified policy and charging controller UPCC
  • general voice service server asdvance telephony server, ATS
  • call session Control function module call session control function, CSCF
  • unified gateway unified gateway
  • the 5G network change scenario shown in FIG. 1 may also include other network elements, which will not be shown here.
  • the name of the network element in Figure 1 is just an example.
  • future communications such as 6G
  • it can also be called other names, or in future communications, such as 6G
  • the network elements involved in this application can also use other network elements with the same functions. Entity or equipment, etc., which are not limited in this application.
  • I won’t repeat it in the follow-up.
  • the anomaly detection method provided by the embodiments of the present application can be applied to the network elements shown or not shown in FIG. 1 above, and can also be applied to chips or chipsets involved in the network elements.
  • the anomaly detection method provided by the present application is described by taking the execution subject as the anomaly detection device as an example, and the specific process of the method may include:
  • Step 201 The anomaly detection device determines a first matrix according to the first values of the multiple KPIs of the first service and the first neural network model, and the first matrix includes the predicted values of the multiple KPIs at N time points The difference between the first value and the first value of the plurality of KPIs; the predicted value of the plurality of KPIs is obtained based on the first neural network model; the first neural network model is based on the first service The historical values of multiple KPIs are determined; the first business is any one of the multiple businesses; N is an integer greater than or equal to 1.
  • the multiple KPIs are related indicators of the first business, that is, related to the first business.
  • the multiple KPIs include all indicators of the first business, for example, L1 layer main indicators (success rate category, etc.), L2 layer secondary indicators (number of attempts, etc.), L3 layer negative indicators (failure error code category, pointing The specific reason for the failure) and so on.
  • the anomaly detection device determines the first matrix according to the first value of the multiple KPIs of the first service and the first neural network model, it needs to group the KPIs based on the service. : The anomaly detection device classifies multiple KPIs according to the business to obtain the KPIs corresponding to the multiple businesses; and selects the KPI corresponding to any one of the multiple businesses as the multiple of the first business. KPIs. It should be understood that any one of the multiple businesses can be used as the first business, that is, the anomaly detection process of the first business in this application can represent the anomaly detection process of all businesses, and the anomaly detection process of all businesses can follow the first business. The anomaly detection process is carried out, and the anomaly detection between multiple businesses is independent of each other.
  • the KPI corresponding to each service may exist in the form of a KPI grouping table, and the multiple services may include registration-related services (for example, call services), access-related services, and so on.
  • the offline training process of the first neural network model historical KPIs are grouped (classified) according to business, and a KPI grouping table of multiple businesses is obtained, as shown in (a) in Figure 3; online detection (ie, real-time anomaly detection)
  • the real-time KPI is grouped according to the business, and the KPI grouping table of multiple businesses is obtained, for example, as shown in (b) in Figure 3.
  • the KPIs of different businesses are separated from each other, so that the anomaly detection tasks of different businesses do not interfere with each other. Because, if all KPIs of the entire network are put together for testing, unless the entire network is abnormal, business-level granular anomalies can easily be overwhelmed. In addition, the granularity of anomaly detection can be controlled at the business level in this way, facilitating anomaly location. In addition, from an algorithm perspective, different KPIs can be used as features in anomaly detection tasks. Too many KPIs will trigger the "curse of dimensionality".
  • the "curse of dimensionality” specifically refers to the introduction of more irrelevant feature dimensions as the feature dimensions increase, resulting in a decrease in the performance of data analysis or anomaly detection, that is, on low dimensions (features).
  • the effect of (such as Euclidean distance) will be significantly weakened in a high-dimensional (feature) space. Therefore, by separating KPIs according to business, the number of KPIs can be controlled, and the algorithm performance can be well guaranteed.
  • the traditional single indicator anomaly detection often only monitors the main indicator.
  • the main indicator In the early stage of network abnormality, it is often not obvious on the main indicator, but it has already been shown on the sub-indices or negative indicators.
  • the scope of indicator monitoring (including sub-indices, negative indicators, etc.) is expanded in the same business, so that abnormalities can be detected earlier.
  • Figure 4 shows a schematic diagram of KPIs classified by business.
  • the CSCF network element is used as an example for illustration, and the KPI is classified according to the business in the training process or the real-time anomaly detection process.
  • the first neural network model is determined based on the historical values of the multiple KPIs of the first service, that is, the first neural network model is based on the history of the multiple KPIs of the first service
  • the value obtained by training is the offline training process of the first neural network model, for example, the offline training process shown in FIG. 5.
  • the process of determining the first matrix through the first values of the multiple KPIs of the first service and the first neural network model may be an online detection process as shown in FIG. 5.
  • the characteristics of multiple KPIs are learned through the first neural network model, multiple KPIs are predicted at the same time during detection, and the predicted values of multiple KPIs obtained are compared with the actual values (here, the first value) , Calculate the first matrix of multiple KPIs.
  • the first matrix in this application may be referred to as a residual matrix.
  • the first neural network model may be a composite (composite) long short term memory (LSTM) neural network.
  • the composite LSTM neural network may be constructed by combining the Encoder-Decoder framework and the LSTM recurrent neural network, and combines the characteristics of the Encoder-Decoder framework and the LSTM recurrent neural network.
  • DNN deep neural network
  • CNN convolutional neural network
  • this application may use a generative neural network, that is, to train a neural network in the form of output reconstruction and input, such as the generative neural network shown in FIG. 7.
  • the number of neurons in the hidden layer is generally less than that of the input/output layer, so the input data is compressed in the hidden layer and the main features are effectively extracted.
  • autoencoder based on Encoder-decoder framework, etc.
  • this application can use a recurrent neural network, that is, the hidden layer can be updated over time, such as the recurrent neural network shown in FIG. 8.
  • the above-mentioned generative neural network is an unsupervised mode.
  • the recurrent neural network is a supervised form.
  • the most common recurrent neural network is a recurrent neural network (RNN).
  • this application chooses to generate a neural network when building the first neural network model, and borrows the Encoder-Decoder framework.
  • KPI has obvious timing characteristics, it is suitable for recurrent neural networks, and it can also borrow LSTM networks that have longer timing memory than RNNs.
  • the LSTM network realizes the relevance of memory for a longer time by determining which information is forgotten or stored, as shown in the schematic diagram of the LSTM network in Figure 9.
  • the composite LSTM neural network built by this application integrates the Encoder-Decoder framework and the LSTM network.
  • the composite LSTM neural network may be as shown in Figure 10, which may specifically include:
  • Reconstruction part assist the neural network to automatically extract KPI waveforms and correlation features
  • Prediction part Multi-KPI prediction based on the extracted features, and finally the multi-KPI residual matrix is calculated;
  • Prediction (preparation) part There is no fusion Encoder-decoder framework, which is used to output the prediction output and the multi-KPI residual matrix without considering the KPI relevance. Because the multi-KPI residual matrix can be additionally used as input by other algorithms. For requirements that hope that the residuals between KPIs do not affect each other (regardless of relevance), such as incident aggregation and grading positioning, this output can be used instead.
  • the core goal of Composite LSTM neural network is to predict multiple KPIs based on the learned KPI features (waveform + correlation), and then compare the predicted and actual values of multiple KPIs. , Generate a residual matrix of multiple KPIs.
  • the traditional forecasting method will establish the relationship between time points, that is, all the data between the current time point and the previous n time points, as shown in Figure 11.
  • the KPI in the network change scenario has the characteristics of zero drop and deformation, it is necessary to design a more robust prediction method.
  • the Composite LSTM neural network itself is already more robust than traditional methods.
  • the present application improves the prediction mechanism to establish the relationship between time windows, that is, the data in the current time window and all the data in the previous n consecutive time windows, as shown in FIG. 12. Since the traditional method is the relationship between the current point and the previous n points, the method of this application is the relationship between the current window data and the previous n window data, because the redundant space is larger, the effect of reducing zero is obvious, so The method of this application can well reduce the impact of KPI dropping zero at a single time point.
  • the prediction mechanism based on time window is applied to the prediction of multiple KPIs, and the input and output views of the neural network can be obtained, as shown in FIG. 13.
  • the relationship between the data in Ti-b ⁇ and the data in (Lookforward) f time windows ⁇ Ti,...,Ti+f ⁇ , where Ti is the current time window.
  • the neural network needs to be improved accordingly.
  • the input and output of the Composite LSTM neural network can be converted into tensor form, as shown in Figure 14.
  • Figure 14 shows a three-dimensional matrix (sample, time, feature), where samples correspond to time points in a time window, time corresponds to different time windows, and features correspond to different KPIs.
  • Lookforward is generally set to 1, and lookback can be adjusted as needed.
  • Lookforward can also be set to other values, which is not limited in this application. For ease of description, this application only takes Lookforward set to 1 as an example for description.
  • an example of a composite LSTM network information view obtained based on the above method may be as shown in FIG. 15.
  • Figure 15 also shows the input part, reconstruction part and prediction part of Composite LSTM.
  • the anomaly detection device determines the first matrix according to the first values of the multiple KPIs of the first service and the first neural network model .
  • the specific method may be: the abnormality detection device generates a second matrix based on the first values of the multiple KPIs, and inputs the second matrix into the first neural network model to obtain the Predictive values of multiple KPIs; then the abnormality detection device determines the difference between the predicted values of the multiple KPIs and the first values of the multiple KPIs at N time points, and generates the first matrix; wherein, The second matrix includes the second values of multiple KPIs at N time points in each of the M acquisition windows before the current acquisition window (that is, the current time window); M is an integer greater than or equal to 1. .
  • the second matrix is input in the form of the input of the Composite LSTM neural network in Figure 14, and the predicted values of the multiple KPIs at N time points are obtained and output in the form of the output of the Composite LSTM neural network in Figure 14 .
  • data preprocessing may be performed on the second matrix to complete operations such as normalization and data format adaptation.
  • the training data (that is, the historical values of multiple KPIs) generally use normal data at least more than 3 days before the network change operation.
  • Normal data for 7 days a week, for example, as shown in Figure 16.
  • the algorithm training supports robustness and has certain fault tolerance, so a small number of abnormal data points are allowed in the training data. But a large area of long-term data anomalies will cause serious data training.
  • the network change operation starts, that is, the abnormality detection device starts to perform the abnormality detection process, as shown in FIG. 16, for example.
  • anomaly detection will be detected together with historical data and the latest data, that is, data from more than 3 days before the change (usually 7 days a week) to the latest data point in time Detection. Whenever new data arrives, after the anomaly detection is completed, only the results at the time point of the latest data are reported. It should be noted that after the network change operation, the detection result within one hour is generally unreliable, so it is not reported. The reason is as follows: KPI may drop to zero after the network change operation, causing false alarms; the general sampling frequency is 15 minutes/point.
  • Figure 17 shows a schematic diagram of the effect of using Composite LSTM to deal with the KPI zero drop and deformation problems in a network change scenario.
  • Figure 17 only tests the KPI of a single indicator, showing the processing effect after the network change.
  • the abnormality caused can be manually suppressed because the time point is known.
  • the forecast with the actual data it can be seen that the zero drop at the time of network change does not affect the forecast of subsequent data.
  • the data can still be predicted very well, and the residual value has been in a low state.
  • Composite LSTM is more sensitive to sudden changes than climbing, and has better robustness to slight deformation (such as climbing after zero drop).
  • FIG. 18 shows a test effect diagram based on multiple types of KPIs.
  • multiple different types of KPIs do not need to be classified and can be processed by the Composite LSTM algorithm at the same time.
  • Composite LSTM predicts multiple KPIs at the same time based on the learned waveform and correlation characteristics, and the deviation between actual data and predicted data is the residual value.
  • the two lines identifying actual data and predicted data basically overlap, and the bottom line identifies the residual value corresponding to the KPI.
  • the line corresponding to the residual value It is indicated by the square of the residual value.
  • the relevance of multiple KPIs (like rising and falling) is destroyed, which will also cause the residual value to rise.
  • the second (from top to bottom) periodic KPI does not rise and fall at the same time as other KPIs.
  • the association is broken, so there is a higher residual value.
  • a large area of KPI has a high residual value, which is likely to point to business abnormalities; when only a single or a small number of indicators have high residuals, it is likely that KPIs occasionally jitter, and false alarms will be suppressed after comprehensive judgment (single indicator anomaly detection This will cause false positives).
  • the Composite LSTM neural network will output a multi-KPI residual matrix for subsequent abnormality judgment; it can also be used for other algorithm input (currently used projects, such as anomaly detection based on lifelong learning, incident aggregation grading).
  • Step 202 The abnormality detection device determines the abnormal result corresponding to the N time points respectively according to the first matrix, and the abnormal result corresponding to any one time point is whether the first service is abnormal at any one time point .
  • the first matrix (that is, the residual matrix) may be performed based on the isolation forest (Iforest) algorithm, and the difference in the first matrix is also the residual (value). ) Perform processing to obtain a 1/0 abnormal result based on the time point, for example, as shown in Figure 19.
  • “1" indicates that the first service is abnormal (or not abnormal) at the corresponding time point
  • "0" indicates that the first service is not abnormal (or abnormal) at the corresponding time point.
  • the residual matrix (that is, the first matrix) can be regarded as multi-dimensional data (samples, features), with different KPIs as features, and different time points as samples. Because the residuals of multiple KPIs have no obvious timing characteristics, the timing characteristics may not be considered. Therefore, the multi-index KPI abnormality detection can be understood as: based on the residuals (features) of the multi-KPI, the time point (sample) of the abnormality is comprehensively judged.
  • Iforest is a more appropriate algorithm.
  • Iforest can be shown in the schematic diagram shown in Fig. 20: a binary search tree is constructed, and each time the feature space is divided, a value is randomly selected between the maximum value and the minimum value for division. Global anomalies will be separated and isolated earlier when cutting the feature space. Therefore, the global anomaly point is closer to the root node in the tree, and the depth is shallower. However, local abnormal points are as difficult to separate as normal points, and they are far from the root node in the tree and have a deeper depth. Iforest can construct multiple trees to form a forest, and comprehensively calculate the depth of the point as an outlier.
  • Iforest calculates the threshold based on prior knowledge (abnormal percentage), which means that some samples (time points) will always be detected as abnormal.
  • algorithm code shown below: self._threshold_ np.percentile(self.decision_function(X), 100.*self._contamination)
  • the Iforest code in the machine learning algorithm library (Scikit-learn) will be based on the pollution (contamination) parameter
  • the integrated residual value (decision_function(x) returns the residual vector) to calculate the threshold. That is, it is assumed here what percentage of the detected data is contaminated (and abnormal), and then a threshold is calculated based on this percentage. Therefore, even if it is a normal network change operation, some samples are mistakenly detected as abnormal, which will cause false alarms.
  • the threshold calculated by Iforest is T_iforest
  • k is the parameter for calculating the lower limit of the threshold
  • k can control the value of the lower line of the threshold.
  • the improved threshold can be calculated as: Max(T_iforest, Q3+k*IQR).
  • the improved Iforest algorithm application programming interface application programming interface, API
  • contamination controls the percentage of abnormalities; n_estimators is the number of trees to be constructed; split_time is the time point of the network change operation for fetching training data; k controls the lower limit of the threshold, the larger the k, the higher the lower limit.
  • the threshold calculated by the traditional Iforest will cause many false alarms, the improved threshold will be relatively higher, and the false alarms will be suppressed.
  • the abnormality detection device determines the abnormal results corresponding to the N time points according to the first matrix.
  • the specific method may be (that is, the processing method based on the Iforest algorithm): the abnormality detection device is based on the The difference value corresponding to the multiple KPIs at each time point in the first matrix determines a KPI comprehensive abnormal value corresponding to each time point; and the KPI comprehensive abnormal value corresponding to each time point is determined with the first threshold value Whether the first service is abnormal at each point in time; when the KPI comprehensive abnormal value at a point in time is greater than the first threshold, the abnormality detection device determines that the first service is abnormal at the point in time When the KPI comprehensive abnormal value at a point in time is less than or equal to the first threshold, the abnormality detection device determines that the first business is not abnormal at the point in time; wherein, the first threshold is The maximum value of the second threshold and the third threshold; the second threshold is obtained based on the non-abnormal historical values of the multiple KPIs of the first service
  • the first threshold is the above-mentioned improved threshold
  • the second threshold is the above Q3+k*IQR
  • the third threshold is T_iforest.
  • the third threshold is determined based on the first value of the multiple KPIs at the N time points and a preset abnormality percentage.
  • the specific method is: the abnormality detection device will The N KPI comprehensive outliers corresponding to the N time points are sorted from large to small, and the sorted KPI comprehensive outliers are obtained; wherein, the comprehensive outlier corresponding to each time point is based on the N time points.
  • the first value of the multiple KPIs is determined by the difference corresponding to the multiple KPIs at each time point in the first matrix; the abnormality detection device is based on the preset abnormality percentage, in the From the sorted KPI comprehensive abnormal values, determine the target KPI comprehensive abnormal value corresponding to the abnormal percentage, and use the determined target KPI comprehensive abnormal value as the third threshold.
  • the sorted comprehensive abnormal values are 1, 2, 3, 4, 5, and the abnormal percentage is 20%, which means that one point is abnormal.
  • the third threshold here is 4, and the points greater than 4 are abnormal. .
  • Step 203 The abnormality detection device determines the abnormality degree of the multiple KPIs at each abnormal time point according to the abnormal result and the first matrix, and the abnormality degree of any KPI at each abnormal time point is the arbitrary
  • the difference value corresponding to one KPI accounts for the percentage of the sum of the difference values corresponding to the multiple KPIs.
  • the abnormality detection device executes the process of step 203, which may be the KPI abnormality calculation process as shown in FIG. 19, so that obtaining the abnormality of the KPI at each abnormal time point can facilitate the operation and maintenance personnel to troubleshoot the problem.
  • the abnormal time point refers to the time point when the first service is abnormal.
  • Step 204 The abnormality detection device determines the abnormality type of the first service at each abnormal time point according to the abnormality degree of the multiple KPIs at each abnormal time point.
  • the abnormality detection device determines the abnormality type of the first service according to the abnormality degree of the multiple KPIs at each abnormal time point.
  • the specific method may be: the abnormality detection device calculates each abnormality time Sort the abnormality of multiple KPIs at a point from high to low, get the abnormality of multiple sorted KPIs at each time point, and compare the abnormality of the first H KPIs at each abnormal time point to the corresponding abnormality
  • the type is used as the abnormal type of the first service at each abnormal time point; H is an integer greater than or equal to 1.
  • the abnormality type of the first service is determined, that is, the abnormality is further located.
  • the method of sorting by abnormality can make it easier for operation and maintenance personnel to determine the abnormal type of the business. For example, for the indicators of the L3 layer, such as the number of authentication failures and the higher anomaly degree ranking, this time anomaly is likely to point to the authentication process.
  • the abnormality detection device may directly output the abnormality degree ranking results of multiple KPIs, as shown in FIG. 19.
  • the abnormality detection device calculates the KPI abnormality ranking at each abnormal time point
  • the residual value of A KPIs at the abnormal time point t is set to ⁇ x1, x2,..., xA ⁇
  • the anomaly degrees (percentages) of A KPI are ⁇ x1/ ⁇ xi,x2/ ⁇ xi,...,xA/ ⁇ xi ⁇
  • ⁇ xi is the sum of all residual values at the current abnormal time point t, based on the final algorithm Sort and output the above KPI abnormalities.
  • the sorting at the corresponding abnormal time points can be as shown in Figure 23. Among them, only the top 4 KPIs at 3 abnormal time points are shown in the figure. Based on this, it can be determined that this abnormal fault is mainly related to indicators and services related to the ATS T side call-through rate.
  • the flow of the abnormality detection method of the present application may be as shown in FIG. 24, which may include an offline training process and an online real-time detection process.
  • multiple indicator KPIs are integrated for anomaly detection to solve the problem of false positives and underreports of single indicator anomaly detection.
  • Performing KPI classification based on business scenarios and performing anomaly detection separately can reduce the business granularity of detecting anomalies.
  • the neural network outputs a multi-KPI residual matrix, which can be used for the abnormality judgment of this application, and can also be used for other algorithm input (for example, incident aggregation and grading).
  • this application uses the improved Iforest algorithm to output comprehensive abnormality judgments based on time points.
  • the KPI abnormality calculation module is added to output multiple KPI abnormality rankings.
  • Using the anomaly detection method provided by the embodiments of the present application can solve the problems of KPI zero drop and deformation in the network change scenario, and at the same time, aim at the defects of the various single indicator anomaly detection mentioned above to realize faster discovery in the network change scenario
  • the network is abnormal, so that the operation and maintenance personnel can find the network abnormality during the change period as soon as possible, and stop the loss in time.
  • an embodiment of the present application also provides an abnormality detection device, which is used to implement the abnormality detection method provided by the embodiment shown in FIG. 2.
  • the abnormality detection device 2500 includes a first processing unit 2501, a second processing unit 2502, and a third processing unit 2503, where:
  • the first processing unit 2501 is configured to determine a first matrix according to the first values of the multiple key performance indicators KPIs of the first business and the first neural network model, and the first matrix includes the The difference between the predicted values of multiple KPIs and the first values of the multiple KPIs; the predicted values of the multiple KPIs are obtained based on the first neural network model; the first neural network model is based on the The historical values of the multiple KPIs of the first service are determined; the first service is any one of the multiple services; N is an integer greater than or equal to 1;
  • the second processing unit 2502 is configured to determine the abnormal result corresponding to the N time points respectively according to the first matrix, and the abnormal result corresponding to any one time point is whether the first business at any one time point abnormal;
  • the third processing unit 2503 is configured to determine the abnormality degree of the multiple KPIs at each abnormal time point according to the abnormal result and the first matrix, and the abnormality degree of any KPI at each abnormal time point is the The difference corresponding to any one KPI accounts for the percentage of the sum of the difference values corresponding to the multiple KPIs; and, according to the abnormality degree of the multiple KPIs at each abnormal time point, the difference at each abnormal time point is determined Describe the abnormal type of the first service.
  • the anomaly detection device 2500 may further include: a fourth processing unit configured to perform the first processing unit according to the first value of the multiple KPIs of the first service and the first neural network Before the model determines the first matrix, the multiple KPIs are classified according to the business, and the KPIs corresponding to the multiple businesses are obtained; among the KPIs corresponding to the multiple businesses, the KPI corresponding to any one business is selected as the multiple of the first business. KPI.
  • the first processing unit 2501 determines the first matrix according to the first values of the multiple KPIs of the first service and the first neural network model, it is specifically configured to:
  • the first value of KPI generates a second matrix, and the second matrix includes the second values of multiple KPIs at N time points in each of the M acquisition windows before the current acquisition window;
  • Input the matrix into the first neural network model to obtain the predicted values of the multiple KPIs at N time points; determine the difference between the predicted values of the multiple KPIs at N time points and the first values of the multiple KPIs Difference, generate the first matrix;
  • M is an integer greater than or equal to 1.
  • the second processing unit 2502 determines the abnormal results corresponding to the N time points according to the first matrix, it is specifically configured to: The difference corresponding to each KPI determines a KPI comprehensive abnormal value corresponding to each time point; the KPI comprehensive abnormal value corresponding to each time point and the first threshold value are used to determine whether the first business is abnormal at each time point; When the comprehensive KPI abnormal value at a time point is greater than the first threshold, it is determined that the first business is abnormal at the time point; when the KPI comprehensive abnormal value at a time point is less than or equal to the first threshold Time, it is determined that the first service is not abnormal at the time point; wherein, the first threshold value is the maximum value of the second threshold value and the third threshold value; the second threshold value is based on the first service The non-abnormal historical values of the multiple KPIs are obtained; the third threshold is determined based on the first values of the multiple KPIs at the N time points and a preset abnormal percentage.
  • the second processing unit 2502 determines the third threshold based on the first value of the multiple KPIs at the N time points and the preset abnormality percentage, it is specifically configured to:
  • the N KPI comprehensive outliers corresponding to the time points are sorted from large to small, and the sorted KPI comprehensive outliers are obtained; wherein, the comprehensive outlier corresponding to each time point is based on the multiple KPIs at the N time points
  • the difference value corresponding to the multiple KPIs at each time point in the first matrix obtained from the first value of is determined; according to the preset abnormal percentage, it is determined from the sorted KPI comprehensive abnormal value
  • the target KPI comprehensive abnormal value corresponding to the abnormal percentage; the determined target KPI comprehensive abnormal value is used as the third threshold.
  • the third processing unit 2503 determines the abnormality type of the first service according to the abnormality degree of the multiple KPIs at each abnormal time point, it is specifically configured to: Sort the abnormality degrees of multiple KPIs at each abnormal time point from high to low, and get the abnormality degrees of multiple sorted KPIs at each time point; add each abnormal time point to the abnormality of the first H KPIs The abnormal type corresponding to the degree is used as the abnormal type of the first service at each abnormal time point; H is an integer greater than or equal to 1.
  • Using the anomaly detection device provided by the embodiment of the application can solve the problems of KPI zero drop and deformation in the network change scenario, and realize the rapid detection of network anomalies in the network change scenario, so that the operation and maintenance personnel can find the network abnormality during the change period as soon as possible, and timely Stop loss.
  • the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .
  • the embodiments of the present application also provide an abnormality detection device, which is used to implement the abnormality detection method shown in FIG. 2.
  • the abnormality detection device 2600 may include: a processor 2601 and a memory 2602, where:
  • the processor 2601 may be a central processing unit (CPU), a network processor (NP), or a combination of a CPU and an NP.
  • the processor 2601 may further include a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
  • the processor 2601 and the memory 2602 are connected to each other.
  • the processor 2601 and the memory 2602 are connected to each other through a bus 2603;
  • the bus 2603 may be a Peripheral Component Interconnect (PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture). , EISA) bus and so on.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used to represent in FIG. 26, but it does not mean that there is only one bus or one type of bus.
  • the memory 2602 is used to store programs and the like.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 2602 may include RAM, or may also include non-volatile memory, such as one or more disk memories.
  • the processor 2601 executes the application program stored in the memory 2602 to realize the above-mentioned functions, thereby realizing the function of the abnormality detection device 2600.
  • the processor 2601 is configured to couple with the memory 2602, call program instructions in the memory 2602, and perform the following operations to implement the abnormality detection method provided by the embodiment of the present application:
  • the first matrix is determined according to the first values of the multiple key performance indicators KPIs of the first business and the first neural network model, and the first matrix includes the predicted values of the multiple KPIs at N time points and the The difference between the first values of multiple KPIs; the predicted values of the multiple KPIs are obtained based on the first neural network model; the first neural network model is based on the multiple KPIs of the first business
  • the historical value of is determined; the first service is any one of a plurality of services; N is an integer greater than or equal to 1;
  • the abnormality degree of the multiple KPIs at each abnormal time point is determined according to the abnormal result and the first matrix, and the abnormality degree of any KPI at each abnormal time point is the difference corresponding to the any KPI. Describe the percentage of the sum of the differences corresponding to multiple KPIs;
  • the abnormality type of the first service at each abnormal time point is determined according to the abnormality degree of the multiple KPIs at each abnormal time point.
  • the processor 2601 before determining the first matrix according to the first value of the multiple KPIs of the first service and the first neural network model, is further configured to: The classification is performed to obtain KPIs corresponding to multiple businesses; KPIs corresponding to any one of the multiple businesses are selected as multiple KPIs of the first business.
  • the processor 2601 determines the first matrix according to the first values of the multiple KPIs of the first service and the first neural network model, it is specifically configured to: generate the second matrix based on the first values of the multiple KPIs.
  • a matrix, the second matrix includes the second values of multiple KPIs at N time points in each of the M acquisition windows before the current acquisition window; the second matrix is input to the first neural network Model to obtain the predicted values of the multiple KPIs at N time points; determine the difference between the predicted values of the multiple KPIs at N time points and the first values of the multiple KPIs, and generate the first Matrix; M is an integer greater than or equal to 1.
  • the processor 2601 determines the abnormal results respectively corresponding to the N time points according to the first matrix, it is specifically configured to: based on the above multiple KPIs at each time point in the first matrix The corresponding difference determines a KPI comprehensive abnormal value corresponding to each time point; the KPI comprehensive abnormal value corresponding to each time point and the first threshold value are used to determine whether the first business is abnormal at each time point; When the comprehensive KPI abnormal value at a time point is greater than the first threshold, it is determined that the first business is abnormal at the time point; when the KPI comprehensive abnormal value at a time point is less than or equal to the first threshold, then It is determined that the first service is not abnormal at the time point; wherein, the first threshold value is the maximum value of the second threshold value and the third threshold value; the second threshold value is based on the first service The non-abnormal historical values of the multiple KPIs are obtained; the third threshold is determined based on the first values of the multiple KPIs at the N time points and the preset abnormal percentage.
  • the processor 2601 determines the third threshold value based on the first value of the multiple KPIs at the N time points and the preset abnormality percentage
  • the processor 2601 is specifically configured to:
  • the corresponding comprehensive outliers of the N KPIs are sorted from largest to smallest, and the sorted comprehensive outliers of KPIs are obtained; wherein, the comprehensive outlier corresponding to each time point is based on the first of the multiple KPIs at the N time points.
  • the difference between the multiple KPIs at each time point in the first matrix obtained by one value is determined; according to the preset abnormality percentage, the sorted KPI comprehensive abnormal value is determined
  • the comprehensive abnormal value of the target KPI corresponding to the abnormal percentage; the determined comprehensive abnormal value of the target KPI is used as the third threshold.
  • the processor 2601 determines the abnormality type of the first service according to the abnormality degree of the multiple KPIs at each abnormal time point, it is specifically configured to: Sort the abnormalities of multiple KPIs at each abnormal time point from high to low, and obtain the abnormalities of the multiple sorted KPIs at each time point; correspond to the abnormalities of the first H KPIs at each abnormal time point
  • the abnormal type of is used as the abnormal type of the first service at each abnormal time point; H is an integer greater than or equal to 1.
  • Using the anomaly detection device provided by the embodiment of the application can solve the problems of KPI zero drop and deformation in the network change scenario, and realize the rapid detection of network anomalies in the network change scenario, so that the operation and maintenance personnel can find the network abnormality during the change period as soon as possible, and timely Stop loss.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and when the computer program is executed by a computer, the computer can implement the above method. Any of the anomaly detection methods.
  • the embodiments of the present application also provide a computer program product, the computer program product is used to store a computer program, and when the computer program is executed by a computer, the computer can implement any of the abnormality detection methods provided in the foregoing method embodiments.
  • An embodiment of the present application also provides a chip, including a processor and a communication interface, the processor is coupled with the memory, and is used to call a program in the memory to enable the chip to implement any abnormality detection provided by the foregoing method embodiments method.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种异常检测方法及装置,用以实现在网络变更场景较快发现网络异常。方法为:根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵,第一矩阵包括在N个时间点上多个KPI的预测值与多个KPI的第一值的差值;多个KPI的预测值为基于第一神经网络模型得到的;根据第一矩阵确定N个时间点对应的异常结果,任一个时间点对应的异常结果为任一个时间点上第一业务是否异常;根据异常结果和第一矩阵确定每个异常时间点上多个KPI的异常度,每个异常时间点上任一个KPI的异常度为任一个KPI对应的差值占多个KPI对应的差值的和值的百分比;根据每个异常时间点上多个KPI的异常度确定每个异常时间点上第一业务的异常类型。

Description

一种异常检测方法及装置
相关申请的交叉引用
本申请要求在2020年04月24日提交中国专利局、申请号为202010331814.9、申请名称为“一种异常检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种异常检测方法及装置。
背景技术
在网络运营过程中,网络变更的操作非常频繁,涉及众多网元。例如,网络变更操作可以包括升级,割接,扩容等。在网络变更时,核心网涉及的网元众多。尤其,在第五代(5th Generation,5G)网络中,保证网络顺利安全地变更变得更加的重要。因此,在网络变更场景下,需要合适的异常检测功能,来保证在网络变更操作之后的一段时间内(观察期),尽快的发现网络异常,进行维护,确保网络变更成功,或者及时止损(如回退操作)。
目前,现有的异常检测算法都是支撑日常网络场景的,并没有专门适合网络变更场景的异常检测方法。
发明内容
本申请提供一种异常检测方法及装置,用以提出一种适合网络变更场景的异常检测方法,来实现在网络变更场景较快发现网络异常。
第一方面,本申请提供了一种异常检测方法,该方法包括:根据第一业务的多个关键绩效指标(key performance indicator,KPI)的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;并根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;之后,根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,并根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型;其中,所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比。
通过上述方法,可以解决网络变更场景下KPI掉零和形变的问题,实现在网络变更场景较快发现网络异常,以便于运维人员可以尽快发现变更期网络异常,及时止损。
在一种可能的设计中,在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,将多个KPI按照业务进行分类,得到多个业务分别对应的KPI,并在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
通过上述方法,可以使不同的业务的异常检测互补干扰,并将异常检测的颗粒度控制在业务级,方便异常定位,使异常定位更加准确。
在一种可能的设计中,根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵,具体方法可以为:基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;然后将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;最后确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵;其中,M为大于或者等于1的整数。
通过上述方法可以准确地得到所述第一矩阵,也即得到所述多个KPI的实际数据与预测数据的残差,以便后续准确地进行异常定位。
在一种可能的设计中,根据所述第一矩阵确定所述N个时间点分别对应的异常结果,具体方法可以为:基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上所述第一业务未异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定。
通过上述方法,可以准备地确定所述第一业务在哪些时间点上异常,在哪些时间点上未异常,以便后续进一步进行存在异常的时间点上的具体异常定位。
在一种可能的设计中,所述第三阈值基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定,具体方法可以为:将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;将确定的所述目标KPI综合异常值作为所述第三阈值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的。
通过上述方法可以准确地得到第三阈值,以便在进行每个时间点上的异常判断时灵活结合第三阈值和第二阈值确定第一阈值,来抑制误报。
在一种可能的设计中,根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型,具体方法可以为:将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度;将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
通过上述方法,可以帮助运维人员快速定位异常的问题所在。
第二方面,本申请还提供了一种异常检测装置,所述异常检测装置具有实现上述第一方面或第一方面的各个可能的设计示例中的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块或单元。
在一个可能的设计中,所述异常检测装置的结构中可以包括多个处理单元,例如第一处理单元、第二处理单元和第三处理单元等,这些单元可以执行上述第一方面或第一方面的各个可能的设计示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
在一个可能的设计中,所述异常检测装置的结构中包括存储器和处理器,所述处理器被配置为支持所述异常检测装置执行上述第一方面或第一方面的各个可能的设计示例中的相应的功能。所述存储器与所述处理器耦合,其保存所述异常检测装置必要的程序指令和数据。
第三方面,本申请实施例提供的一种计算机可读存储介质,该计算机可读存储介质存储有程序指令,当程序指令在计算机上运行时,使得计算机执行本申请实施例第一方面及其任一可能的设计。示例性的,计算机可读存储介质可以是计算机能够存取的任何可用介质。以此为例但不限于:计算机可读介质可以包括非瞬态计算机可读介质、随机存取存储器(random-access memory,RAM)、只读存储器(read-only memory,ROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、CD-ROM或其他光盘存储、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质。
第四方面,本申请实施例提供一种包括计算机程序代码或指令的计算机程序产品,当其在计算机上运行时,使得计算机可以实现上述第一方面中的任意一种可能的设计中的方法。
第五方面,本申请还提供了一种芯片,所述芯片与存储器耦合,用于读取并执行所述存储器中存储的程序指令,以实现上述第一方面中的任意一种可能的设计中的方法。
上述第二方面至第五方面中的各个方面以及各个方面可能达到的技术效果请参照上述针对第一方面中的各种可能方案可以达到的技术效果说明,这里不再重复赘述。
附图说明
图1为本申请提供的一种5G网络变更场景的示意图;
图2为本申请提供的一种异常检测方法的流程图;
图3为本申请提供的一种KPI分组的示意图;
图4为本申请提供的一种按照业务分类的KPI的示意图;
图5为本申请提供的一种离线训练和在线检测的示意图;
图6为本申请提供的一种传统神经网络的示意图;
图7为本申请提供的一种生成神经网络的示意图;
图8为本申请提供的一种递归神经网络的示意图;
图9为本申请提供的一种LSTM网络的示意图;
图10为本申请提供的一种composite LSTM神经网络的示意图;
图11为本申请提供的一种当前时间点和之前n个时间点之间的关系的示意图;
图12为本申请提供的一种当前时间窗内的数据和之前n个连续时间窗内的所有数据的示意图;
图13为本申请提供的一种神经网络的输入输出视图的示意图;
图14为本申请提供的一种Composite LSTM神经网络的输入输出的示意图;
图15为本申请提供的一种Composite LSTM的网络信息视图的示例;
图16为本申请提供的一种网络变更操作前后数据的示意图;
图17为本申请提供的一种采用Composite LSTM应对网络变更场景KPI掉零和形变问题的效果示意图;
图18为本申请提供的一种基于多个类型的KPI的测试效果图;
图19为本申请提供的一种得到第一矩阵之后的流程示意图;
图20为本申请提供的一种Iforest的原理示意图;
图21为本申请提供的一种阈值示意图;
图22为本申请提供的一种训练数据,检测数据和每个时间点上的KPI综合异常值的示意图;
图23为本申请提供的一种异常时间点上的多个KPI的异常度的排序的示意图;
图24为本申请提供的一种异常检测方法的流程的示意图;
图25为本申请提供的一种异常检测装置的结构示意图;
图26为本申请提供的一种异常检测装置的结构图。
具体实施方式
下面将结合附图对本申请作进一步地详细描述。
本申请实施例提供一种异常检测方法及装置,用以提出一种适合网络变更场景的异常检测方法,来实现在网络变更场景较快发现网络异常。其中,本申请所述方法和装置基于同一技术构思,由于方法及装置解决问题的原理相似,因此装置与方法的实施可以相互参见,重复之处不再赘述。
在网络运营过程中,相比日常监控场景,网络变更场景下的关键绩效指标(key performance indicator,KPI)有如下特点:(1)成功率和业务量相关KPI在变更操作时刻掉零。因为网络变更操作(例如网元升级)存在复位或者踢出用户等操作,因此导致成功率和业务量相关KPI掉零。(2)业务量相关KPI在掉零之后,缓慢爬坡,此阶段KPI有形变。例如,在复位或者踢出用户等操作之后,用户会开始逐渐接入网络(如被踢出的用户重新加入升级的网元),业务量会缓慢爬坡。
然而现有的异常检测算法均为支撑日常监控场景(即无KPI掉零和爬坡形变的场景),当前没专门适合网络变更场景的异常检测算法。由于变更场景下KPI的掉零和形变的特点,现有日常监控场景的异常检测算法会产生大量误报和漏报。
并且,现有技术中通常为单指标异常检测,然而实际上只观测单指标难以支撑检测业务异常。例如,在业务正常的情况下,KPI会偶尔抖动,单指标异常检测会造成误报;又例如,即使出现局部小范围异常,KPI受到影响,但是因为系统韧性较好,鲁棒性较高,业务很快恢复正常,也无需异常上报,然而单指标异常检测误报;又例如,有时系统出现异常,但是在观测的KPI上没有明显体现,或者延迟体现,单指标异常检测会漏报或延迟上报。具体的,同一个业务内,相关的指标包括L1层主指标(成功率类等),L2层副指标(尝试次数等),L3层负向指标(失败错误码类,指向具体失败原因)。传统的单指标一般只监控关键的L1主指标,但是用户基数比较大,显示不明显,存在上述单指标异常检测会漏报、误报或者延迟上报的问题。
基于上述问题,本申请提出一种适合网络变更场景的异常检测方法,解决网络变更场景下KPI掉零和形变的问题,同时针对以上提到的各种单指标异常检测存在的缺陷,来实 现在网络变更场景较快发现网络异常。
在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。
应理解,本申请实施例中“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一(项)个”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a、b或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c,或a、b和c,其中a、b、c可以是单个,也可以是多个。
为了更加清晰地描述本申请实施例的技术方案,下面结合附图,对本申请实施例提供的异常检测方法及装置进行详细说明。
本申请实施例提供的异常检测方法适用于存在网络变更操作的网络中,例如,5G网络或者在未来通信的网络,如6G中。具体的,异常检测方法应用于网络中的众多网元,例如图1所示的5G网络变更场景中,网络变更操作升级、割接、扩容等会涉及到核心网包括的网元中的归属用户服务器(home subscriber server,HSS)、统一服务节点(unified service node,USN)、统一策略计费控制器(unified policy and charging controller,UPCC)、通用语音业务服务器(advance telephony server,ATS)、呼叫会话控制功能模块(call session control function,CSCF)、统一网关(unified gateway,UGW)等。需要说明的是,图1所示的5G网络变更场景中,还可以包括其他网元,此处不再一一示出。图1中网元的名称仅仅作为示例,在未来通信中,如6G中,还可以称为其它名称,或者在未来通信中,如6G中,本申请涉及的网元还可以通过其它具有相同功能的实体或者设备等来替代,本申请对此均不作限定。这里做统一说明,后续不再赘述。
本申请实施例提供的异常检测方法可以应用于上述图1中示出或未示出的网元,也可以应用于涉及网元中的芯片或芯片组。参阅图2所示,以执行主体为异常检测装置为例说明本申请提供的异常检测方法,所述方法的具体流程可以包括:
步骤201:异常检测装置根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数。
其中,所述多个KPI为所述第一业务的相关指标,也即与所述第一业务相关。所述多个KPI包括所述第一业务的所有指标,例如,L1层主指标(成功率类等),L2层副指标(尝试次数等),L3层负向指标(失败错误码类,指向具体失败原因)等等。
在一种可选的实施方式中,所述异常检测装置在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,需要对KPI进行基于业务的分组,具体的:所述异常检测装置将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;并在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。应理解,多个业务中任一个业务都可以作为第一业务,也即本申请中第一业务的异常检测流程可以代表所有业务的异常检测流程,所有业务的异常检测流程都可以按照第一业务的异常检测流 程进行,并且多个业务之间的异常检测是互相独立的。
具体的,每个业务对应的KPI可以以KPI分组表形式存在,所述多个业务可以包括注册相关业务(例如呼叫业务)、接入相关业务等等。在第一神经网络模型的离线训练过程中,将历史KPI按照业务进行分组(分类),得到多个业务的KPI分组表,例如图3中的(a)所示;在线检测(即实时异常检测)过程中,将实时KPI按照业务进行分组,得到多个业务的KPI分组表,例如图3中的(b)所示。
通过上述方法,把不同的业务的KPI相互分离,可以使不同业务的异常检测任务互不干扰。因为,如果把全网所有KPI放在一起做检测,除非出现全网异常,业务级颗粒度的异常很容易被淹没。并且,这样可以将异常检测的颗粒度控制在业务级,方便异常定位。另外,从算法角度来看,不同的KPI可以作为异常检测任务中的特征。KPI过多会触发“维度诅咒问题”,“维度诅咒问题”特指随着特征维度的增加会引入更多无关的特征维度,导致数据分析或者异常检测的性能下降,即低维度(特征)上的效应(如欧式距离)在高维度(特征)空间中会明显减弱。因此,将KPI按照业务分离,可以控制KPI个数,可以很好的保证算法性能。
进一步地,传统的单指标异常检测往往只监控主指标。在网络异常出现的初期,在主指标上往往体现不显著,然而在副指标或负向指标上已经显示出来。例如,网络出现异常,失败次数增加,但是由于用户基数较大,成功率指标不会出现明显下降。本申请实施例中在相同业务内扩大指标监控范围(包含副指标,负向指标等),可以更早检测出异常。
例如,图4示出了一种按照业务分类的KPI的示意图。其中,图4中仅以CSCF网元为例说明,在训练过程或实时异常检测过程均按照业务对KPI分类。
其中,所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定,也即所述第一神经网络模型是基于所述第一业务的所述多个KPI的历史值训练得到的,即对第一神经网络模型的离线训练过程,例如图5所示的离线训练过程。其中,通过所述第一业务的所述多个KPI的第一值和第一神经网络模型确定第一矩阵的过程可以为如图5所示的在线检测过程。具体的,通过第一神经网络模型学习多个KPI的特征(波形+相关性),在检测时多KPI同时进行预测,将得到的多个KPI的预测值对比实际值(这里即第一值),计算多KPI的第一矩阵。示例性的,本申请中的第一矩阵可以称之为残差矩阵。
在一种可选的实施方式中,所述第一神经网络模型可以是合成的(composite)长短期记忆神经网络(long short term memory,LSTM)神经网络。其中,所述composite LSTM神经网络可以是结合编码-解码(Encoder-Decoder)框架和LSTM递归神经网络搭建的,融合了Encoder-Decoder框架和LSTM递归神经网络的特点。
具体的,传统神经网络一般为有监督的形式,如深度神经网络(deep neural network,DNN),卷积神经网络(convolutional neural network,CNN)等,即输出需要标签进行训练,例如图6所示的传统神经网络。但实际中,标签往往是很难大量获得的,比如图像识别和异常检测场景等。对此,本申请可以采用生成神经网络,即输出重建输入的形式训练神经网络,如图7所示的生成神经网络。其中隐藏层的神经元数一般少于输入/输出层,因此输入数据在隐藏层进行压缩,并有效提取出主要特征。如基于Encoder-decoder框架的自编码(autoencoder)等。
而在处理时序数据如KPI等时,需要考虑时序上的相关性。对此,本申请可以采用递归神经网络,即隐藏层可以随时间更新,如图8所示的递归神经网络。上述生成神经网络 为无监督模式,这里递归神经网络为有监督形式,最常见的递归神经网络为递归神经网络(recurrent neural network,RNN)。
在本申请涉及的网络变更场景的异常检测中,往往没有异常标注,因此需要无监督算法。所以本申请在搭建第一神经网络模型时选择生成神经网络,借用Encoder-Decoder框架。因为KPI有明显的时序特征,适用于递归神经网络,同时还可以借用比RNN有更长时序记忆的LSTM网络。LSTM网络通过决定哪些信息遗忘或者储存等功能实现记忆较长时间的相关性,如图9所示的LSTM网络的示意图。基于此,本申请搭建的composite LSTM神经网络融合了Encoder-Decoder框架和LSTM网络,具体的,composite LSTM神经网络可以如图10所示,具体可以包括:
重建部分:辅助神经网络自动提取KPI波形和关联性特征;
预测部分:基于提取的特征进行多KPI预测,最终计算多KPI残差矩阵;
预测(备)部分:没有融合Encoder-decoder框架,用于输出不考虑KPI关联性的预测输出和多KPI残差矩阵。因为多KPI残差矩阵可额外被其他算法作为输入利用。对于希望KPI之间残差互相不影响(不考虑关联性)的需求,如事件(incident)聚合定级定位,可采用此输出替代。
进一步的,实时异常检测的时候,Composite LSTM神经网络的核心目标为,基于学习的KPI特征(波形+关联性),对多KPI同时进行预测,然后得到的多个KPI的预测值和实际值对比,生成多KPI的残差矩阵。传统的预测方法会采用建立时间点之间的关系,即当前时间点和之前n个时间点内所有数据,如图11所示。本申请中由于网络变更场景中KPI存在掉零和形变的特征,因此需要设计出一种较为鲁棒的预测方法。Composite LSTM神经网络本身已经比传统方法更加鲁棒。在此基础上,本申请改进了预测机制为建立时间窗之间的关系,即当前时间窗内的数据和之前n个连续时间窗内的所有数据,如图12所示。由于,传统方法是当前的点和之前n个点的关系,本申请的方法是当前的窗口数据和之前n个窗口数据的关系,因冗余空间更大,减少掉零的影响很明显,因此本申请的方法可以很好的减少单个时间点KPI掉零的影响。
本申请把基于时间窗的预测机制应用在多个KPI的预测上面,可以得到神经网络的输入输出视图,如图13所示。由图13可以看到,多个KPI被作为一个整体的二维数据(时间点,不同KPI)由时间长度为L的窗口采集,然后建立之前回看(Lookback)=b个时间窗口{Ti-1,……,Ti-b}内的数据和之后看(Lookforward)=f个时间窗口{Ti,……,Ti+f}的数据的关系,其中Ti为当前时间窗口。通过上述方法,可以使得多个KPI的预测更加鲁棒,这种方法还可以使得多个KPI之间的关联关系被神经网络充分学习。
本申请中,为了适配上述基于时间窗的预测机制,神经网络需要进行相应的改进。首先,可以把Composite LSTM神经网络的输入输出转换成张量(tensor)形式,如图14所示。图14中示出的为三维矩阵(样本,时间,特征),这里样本对应时间窗口内的时间点,时间对应不同的时间窗口,特征对应不同的KPI。本申请中,Lookforward一般设为1,lookback可据需调整。当然,Lookforward也可以设为其他值,本申请对此不作限定,为了便于描述,本申请仅以Lookforward设为1为例进行说明。
例如,基于上述方法得到的Composite LSTM的网络信息视图的示例可以如图15所示。其中,在图15所示的示例中,以17个KPI,以采样频率为5分钟/点为例,基于采样频率,选择建立当前窗口和之前4个小时内窗口的数据,以此设置Lookback=12*4=48, lookforward=1,所有隐藏层设置为50个神经元,隐藏层和输入层与输出层用LSTM网络连接。图15中也示出了Composite LSTM的输入部分,重建部分和预测部分。
基于上述方法,在本申请基于上述方法得到的第一神经网络模型进行实时异常检测时,所述异常检测装置根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵,具体方法可以为:所述异常检测装置基于所述多个KPI的第一值生成第二矩阵,并将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;然后所述异常检测装置确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵;其中,所述第二矩阵中包括当前采集窗口(也即当前时间窗口)之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;M为大于或者等于1的整数。
其中,所述第二矩阵即按照图14中Composite LSTM神经网络的输入的形式输入,得到N个时间点上所述多个KPI的预测值按照图14中的Composite LSTM神经网络的输出的形式输出。在一种可选的实施方式中,将所述第二矩阵输入所述第一神经网络模型之前可以对所述第二矩阵进行数据预处理,完成归一化,数据格式适配等操作。
需要说明的是,在所述第一神经网络模型的训练过程中,训练数据(即多个KPI的历史值)一般采用网络变更操作之前的至少大于3天的正常数据,可选的,一般采用一周7天的正常数据,例如图16所示。其中,算法训练支持鲁棒性,有一定容错能力,所以允许训练数据中出现少量异常数据点。但是大面积长时间数据异常则会严重数据训练。在异常检测过程中,网络变更操作开始时,即异常检测装置开始执行异常检测过程,例如图16所示。在网络变更操作开始后,每当新数据点到来,异常检测会连同历史数据和最新数据一同检测,即从变更前大于3天(一般为一周7天)到最新的数据时间点的数据均做检测。每当新数据到来,异常检测完成之后,只上报最新数据的时间点上的结果。需要说明的是,网络变更操作之后,一般一个小时内的检测结果不可信,所以不上报,原因如下:网络变更操作之后KPI有可能掉零,造成误报;一般采样频率为15分钟/点,一个小时以内仅采样4个点,数据量太少;网络变更操作之后,短时间内可能不采集数据,导致数据缺失。需要说明的是,在数据缺失时,为了保证数据时间上的连续性,上报数据的时候会补零或者插值。
基于上述,图17示出了一种采用Composite LSTM应对网络变更场景KPI掉零和形变问题的效果示意图。为便于展示,图17中仅对单指标的KPI进行测试,示出了网络变更之后的处理效果。首先,网络变更时刻掉零,造成的异常因为时间点已知,可以人工抑制掉。然后,对比预测和实际数据,可以看出网络变更时刻的掉零不影响后续数据的预测。即使在KPI的爬坡过形变程中,仍能对数据进行很好的预测,残差值一直处于较低状态。此外,还可以观察到Composite LSTM相对于爬坡来说对突变比较敏感,对于轻微形变(如掉零后爬坡)有较好的鲁棒性。
示例性的,图18示出了基于多个类型的KPI的测试效果图。其中,多个不同类型的KPI无需分类,可同时被Composite LSTM的算法处理。Composite LSTM基于学习到的波形和关联性特征,对多个KPI同时进行预测,实际数据和预测数据的偏离为残差值。其中,图18中每个KPI对应的示意图中,标识实际数据和预测数据的两条线基本重合,最下边一条线标识KPI对应的残差值,其中,为了便于查看,残差值对应的线是以残差值的平方示意的。多个KPI的关联性(如同升同降)被破坏,也会导致残差值升高,如图18中, 第二个(由上至下)周期性KPI没有和其他KPI同升同降,关联被破环,因此也出现较高的残差值。同一时刻大面积KPI残差值较高,大概率指向业务异常;当只有单个或者很少量指标出现高残差,很可能只是KPI偶尔抖动,综合判断后会误报被抑制(单指标异常检测在这里会造成误报)。Composite LSTM神经网络会输出多KPI残差矩阵,用于后续异常判断;也可以用于其他算法输入(当前已经用于的项目,如基于终身学习的异常检测,incident聚合定级)。
步骤202:所述异常检测装置根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常。
示例性的,所述异常检测装置执行步骤202时,可以基于孤立森林算法(isolation forest,Iforest)对第一矩阵(也即残差矩阵,第一矩阵中的差值也即残差(值))进行处理,得到基于时间点的1/0异常结果,例如图19所示。其中可选的,“1”表示对应的时间点上所述第一业务异常(或未异常),“0”表示对应的时间点上所述第一业务未异常(或异常)。
图19中,残差矩阵(也即第一矩阵)可以看作多维数据(样本,特征),不同KPI为特征,不同时间点为样本。因为多个KPI的残差已经没有明显的时序特征,所以可以不考虑时序特性。因此,对多指标的KPI异常检测的可以理解为:基于多KPI的残差(特征),综合判断异常的时间点(样本)。
需要说明的是,本申请之所以采用Iforest算法,是因为:大量KPI同时异常时,业务大概率异常,因此需要适合检测全局异常点的算法;同时需要抑制单个/少量KPI抖动造成的误报,因此需要对局部异常点不敏感的算法;同时检测多个KPI的数量较多,因此需要适合高维数据检测的算法。综上,Iforest为比较合适的算法的。
具体的,Iforest的基本原理可以如图20所示的原理示意图所示出:构建二叉搜索树,在每次分割特征空间的时候,在最大值和最小值内随机选择一个值进行切分。全局异常点在切割特征空间的时候,会更早的被分离孤立出来。因此,全局异常点在树中离根节点较近,深度较浅。而局部异常点和正常点一样很难被分离出来,在树中离根节点较远,深度较深。Iforest可以构造多颗树,形成森林,综合计算点的深度当作异常值。
然而,传统的Iforest存在以下问题:Iforest根据先验知识(异常百分比)计算阈值,意味着总会有些样本(时间点)被检测为异常。如下所示的算法代码中:self._threshold_=np.percentile(self.decision_function(X),100.*self._contamination),机器学习算法库(Scikit-learn)中Iforest代码会根据污染(contamination)参数和综合残差值(decision_function(x)返回残差向量)计算阈值。也即这里假设被检测的数据中有多少百分比的数据被污染(及异常),然后根据这个百分比计算一个阈值。因此,即使为正常网络变更操作,还是有一些样本被误检测为异常,即会造成误报。
对此,本申请对Iforest的阈值进行如下改进:已知训练数据为(或者可以保证)正常数据,可以根据训练数据的综合异常值计算阈值的下线,来抑制误报。可以利用训练数据的综合异常值的四分位距IQR来计算。其中,IQR是统计学中较为稳健的一种统计,类似于中位数,而不是平均数。训练数据的第一四分位距为Q1,第三四分位距为Q3,则四分位距为IQR=Q3-Q1。设Iforest计算的阈值为T_iforest,k为计算阈值下限的参数,k可以控制阈值的下线高低的值。则改进的阈值可以计算为:Max(T_iforest,Q3+k*IQR)。改进的Iforest算法应用程序接口(application programming interface,API)可以如下:
to_overall_anomalies_iForest(data=None,contamination=0.1,n_estimators=100,split_tim  e=None,k=5);
其中,contamination控制异常百分比;n_estimators为构建树的个数;split_time为网络变更操作时间点,用于取出训练数据;k控制阈值下限,k越大,下限越高。
示例性的,图21示出的阈值示意图中,可以观察到传统Iforest计算出来的阈值会造成很多误报,改进后的阈值会相对更高,误报也被抑制掉了。
基于上述方法,所述异常检测装置根据所述第一矩阵确定所述N个时间点分别对应的异常结果,具体方法可以为(也即基于Iforest算法的处理方法):所述异常检测装置基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;并将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;当一个时间点上的KPI综合异常值大于所述第一阈值时,所述异常检测装置则确定在所述时间点上所述第一业务异常;当一个时间点上的KPI综合异常值小于或等于所述第一阈值时,所述异常检测装置则确定在所述时间点上所述第一业务未异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定。其中,所述第一业务的所述多个KPI的未异常的历史值即为网络变更操作之前的一些正常数据。
其中,所述第一阈值即为上述涉及的改进的阈值,所述第二阈值为上述Q3+k*IQR,所述第三阈值为T_iforest。
在一种可选的实施方式中,所述第三阈值基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定,具体可以方法为:所述异常检测装置将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;所述异常检测装置根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值,并将确定的所述目标KPI综合异常值作为所述第三阈值。例如,排序后的综合异常值有1,2,3,4,5,异常百分比是20%,也就是说有1个点为异常,那么这里的第三阈值就是4,大于4的点就是异常。
步骤203:所述异常检测装置根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比。
其中,所述异常检测装置执行步骤203的过程,可以如图19所示的KPI异常度计算过程,这样得到每个异常时间点上的KPI的异常度可以方便运维人员排查问题。具体的,异常时间点是指所述第一业务异常的时间点。
步骤204:所述异常检测装置根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
具体的,所述异常检测装置根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型,具体方法可以为:所述异常检测装置将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度,并将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
具体的,确定所述第一业务的异常类型,也即进一步定位异常所在。通过异常度排序的方法可以使运维人员更容易确定业务的异常类型。比如,L3层的指标,如鉴权失败次数,异常度排序较高,则本次异常大概率指向鉴权流程。
在一种示例性的实施方式中,在步骤203中,所述异常检测装置进行KPI异常度计算之后,可以直接输出多个KPI的异常度排序结果,如图19所示。
示例性的,所述异常检测装置在计算每个异常时间点上的KPI异常度排序时,设在异常时间点t上A个KPI的残差值为{x1,x2,…,xA},则A个KPI的异常度(百分比)分别为{x1/∑xi,x2/∑xi,…,xA/∑xi},∑xi是当前异常时间点t上所有残差值的和,基于算法最终会对以上KPI异常度进行排序输出。
示例性的,基于上述描述,如图22所示的训练数据,检测数据和每个时间点上的KPI综合异常值,可以看到单个/少量KPI的异常在综合异常值上体现不明显,因此被抑制;在网络变更操作之后,出现异常,综合异常值相应变高。进一步地,基于Iforest得到的基于时间点的异常判断可以如下表1所示:
表1
2019/8/13 8:30 FALSE(未异常)
2019/8/13 8:35 FALSE
2019/8/13 8:40 FALSE
2019/8/13 8:45 FALSE
2019/8/13 8:50 FALSE
2019/8/13 8:55 TRUE(异常)
2019/8/13 9:00 TRUE
2019/8/13 9:05 TRUE
2019/8/13 9:10 TRUE
2019/8/13 9:15 TRUE
2019/8/13 9:20 TRUE
经过计算多个KPI的异常度,在相应异常时间点上的排序可以如图23所示。其中,在图中仅示出了3个异常时间点排名前4个的KPI。基于此,可以确定此异常故障主要是在ATS T侧接通率相关指标和业务。
基于上述实施例,在一种具体的示例性的实施例中,本申请的异常检测方法的流程可以如图24所示,可以包括离线训练过程和在线实时检测过程。本申请,综合多指标KPI进行异常检测,解决单指标异常检测的误报和漏报问题。基于业务场景进行KPI分类,分别进行异常检测,可以缩小检测异常的业务颗粒度。利用深度神经网络学习历史多个KPI特征(波形+关联性),解决变更场景KPI形变问题。神经网络输出多KPI残差矩阵,即可以可用于本申请的异常判断,也可用于其他算法输入(例如,incident聚合和定级)。最后,本申请由改进的Iforest算法输出基于时间点的综合异常判断,同时为了方便运维人员定位问题,增加KPI异常度计算模块输出多KPI异常度排序。
采用本申请实施例提供的异常检测方法,可以解决网络变更场景下KPI掉零和形变的问题,同时针对以上提到的各种单指标异常检测存在的缺陷,来实现在网络变更场景较快发现网络异常,以便于运维人员可以尽快发现变更期网络异常,及时止损。
基于上述实施例,本申请实施例还提供了一种异常检测装置,用于实现如图2所示的实施例提供的异常检测方法。参阅图25所示,所述异常检测装置2500中包括第一处理单元2501、第二处理单元2502和第三处理单元2503,其中:
所述第一处理单元2501用于根据第一业务的多个关键绩效指标KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;
所述第二处理单元2502用于根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;
所述第三处理单元2503用于根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比;以及,根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
在一种可选的实施方式中,所述异常检测装置2500还可以包括:第四处理单元用于在所述第一处理单元根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
在一种具体的实施方式中,所述第一处理单元2501在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵时,具体用于:基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵;M为大于或者等于1的整数。
示例性的,所述第二处理单元2502在根据所述第一矩阵确定所述N个时间点分别对应的异常结果时,具体用于:基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上所述第一业务未异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定。
具体的,所述第二处理单元2502在基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定所述第三阈值时,具体用于:将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;将确定的所述目标KPI综合异常值作为所述第三阈值。
在一种可选的实施方式中,所述第三处理单元2503在根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型时,具体用于:将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度;将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
采用本申请实施例提供的异常检测装置,可以解决网络变更场景下KPI掉零和形变的问题,实现在网络变更场景较快发现网络异常,以便于运维人员可以尽快发现变更期网络异常,及时止损。
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
基于以上实施例,本申请实施例还提供了一种异常检测装置,所述异常检测装置,用于实现图2所示的异常检测方法。参阅图26所示,所述异常检测装置2600可以包括:处理器2601和存储器2602,其中:
所述处理器2601可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。所述处理器2601还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
其中,所述处理器2601和所述存储器2602之间相互连接。可选的,所述处理器2601和所述存储器2602通过总线2603相互连接;所述总线2603可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图26中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
在一种可选的实施方式中,所述存储器2602,用于存放程序等。具体地,程序可以包括程序代码,该程序代码包括计算机操作指令。所述存储器2602可能包括RAM,也可能还包括非易失性存储器(non-volatile memory),例如一个或多个磁盘存储器。所述处理器2601执行所述存储器2602所存放的应用程序,实现上述功能,从而实现异常检测装置2600 的功能。
具体的,所述处理器2601用于与所述存储器2602耦合,调用所述存储器2602中的程序指令,执行以下操作以实现本申请实施例提供的异常检测方法:
根据第一业务的多个关键绩效指标KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;
根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;
根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比;
根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
在一种可选的实施方式中,所述处理器2601在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,还用于:将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
具体的,所述处理器2601在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵时,具体用于:基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵;M为大于或者等于1的整数。
示例性的,所述处理器2601在根据所述第一矩阵确定所述N个时间点分别对应的异常结果时,具体用于:基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上所述第一业务未异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定。
具体的,所述处理器2601在基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定所述第三阈值时,具体用于:将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;将确定的所述目标KPI综合异常值作 为所述第三阈值。
在一种可选的实施方式中,所述处理器2601在根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型时,具体用于:将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度;将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
采用本申请实施例提供的异常检测装置,可以解决网络变更场景下KPI掉零和形变的问题,实现在网络变更场景较快发现网络异常,以便于运维人员可以尽快发现变更期网络异常,及时止损。
基于以上实施例,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,该计算机程序被计算机执行时,所述计算机可以实现上述方法实施例提供的任一种异常检测方法。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品用于存储计算机程序,该计算机程序被计算机执行时,所述计算机可以实现上述方法实施例提供的任一种异常检测方法。
本申请实施例还提供一种芯片,包括处理器和通信接口,所述处理器与存储器耦合,用于调用所述存储器中的程序使得所述芯片实现上述方法实施例提供的任一种异常检测方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的保护范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (20)

  1. 一种异常检测方法,其特征在于,应用于网络变更场景,包括:
    根据第一业务的多个关键绩效指标KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;
    根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;
    根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比;
    根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
  2. 如权利要求1所述的方法,其特征在于,在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,所述方法还包括:
    将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;
    在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
  3. 如权利要求1或2所述的方法,其特征在于,根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵,包括:
    基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;M为大于或者等于1的整数;
    将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;
    确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵。
  4. 如权利要求1-3任一项所述的方法,其特征在于,根据所述第一矩阵确定所述N个时间点分别对应的异常结果,包括:
    基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;
    将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定;
    当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;
    当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上 所述第一业务未异常。
  5. 如权利要求4所述的方法,其特征在于,所述第三阈值基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定,包括:
    将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;
    根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;
    将确定的所述目标KPI综合异常值作为所述第三阈值。
  6. 如权利要求1-5任一项所述的方法,其特征在于,根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型,包括:
    将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度;
    将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
  7. 一种异常检测装置,其特征在于,应用于网络变更场景,包括:
    第一处理单元,用于根据第一业务的多个关键绩效指标KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;
    第二处理单元,用于根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;
    第三处理单元,用于根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的差值的和值的百分比;以及
    根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
  8. 如权利要求7所述的装置,其特征在于,还包括:
    第四处理单元,用于在所述第一处理单元根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
  9. 如权利要求7或8所述的装置,其特征在于,所述第一处理单元,在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵时,具体用于:
    基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;M为大于或者等于1的整数;
    将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;
    确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵。
  10. 如权利要求7-9任一项所述的装置,其特征在于,所述第二处理单元,在根据所述第一矩阵确定所述N个时间点分别对应的异常结果时,具体用于:
    基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;
    将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定;
    当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;
    当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上所述第一业务未异常。
  11. 如权利要求10所述的装置,其特征在于,所述第二处理单元,在基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定所述第三阈值时,具体用于:
    将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;
    根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;
    将确定的所述目标KPI综合异常值作为所述第三阈值。
  12. 如权利要求7-11任一项所述的装置,其特征在于,所述第三处理单元,在根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型时,具体用于:
    将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个排序后的KPI的异常度;
    将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
  13. 一种异常检测装置,其特征在于,应用于网络变更场景,包括:
    存储器,用于存储程序指令;
    处理器,用于与所述存储器耦合,调用所述存储器中的程序指令,执行以下操作:
    根据第一业务的多个关键绩效指标KPI的第一值和第一神经网络模型确定第一矩阵,所述第一矩阵中包括在N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值;所述多个KPI的预测值为基于所述第一神经网络模型得到的;所述第一神经网络模型基于所述第一业务的所述多个KPI的历史值确定;所述第一业务为多个业务中的任一个业务;N为大于或者等于1的整数;
    根据所述第一矩阵确定所述N个时间点分别对应的异常结果,任一个时间点对应的异常结果为所述任一个时间点上所述第一业务是否异常;
    根据所述异常结果和所述第一矩阵确定每个异常时间点上所述多个KPI的异常度,每个异常时间点上任一个KPI的异常度为所述任一个KPI对应的差值占所述多个KPI对应的 差值的和值的百分比;
    根据所述每个异常时间点上所述多个KPI的异常度确定每个异常时间点上所述第一业务的异常类型。
  14. 如权利要求13所述的装置,其特征在于,所述处理器,在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵之前,还用于:
    将多个KPI按照业务进行分类,得到多个业务分别对应的KPI;
    在多个业务分别对应的KPI中选择任一个业务对应的KPI作为所述第一业务的多个KPI。
  15. 如权利要求13或14所述的装置,其特征在于,所述处理器,在根据第一业务的多个KPI的第一值和第一神经网络模型确定第一矩阵时,具体用于:
    基于所述多个KPI的第一值生成第二矩阵,所述第二矩阵中包括当前采集窗口之前的M个采集窗口中每个采集窗口中N个时间点上多个KPI的第二值;M为大于或者等于1的整数;
    将所述第二矩阵输入所述第一神经网络模型,得到N个时间点上所述多个KPI的预测值;
    确定N个时间点上所述多个KPI的预测值与所述多个KPI的第一值的差值,生成所述第一矩阵。
  16. 如权利要求13-15任一项所述的装置,其特征在于,所述处理器,在根据所述第一矩阵确定所述N个时间点分别对应的异常结果时,具体用于:
    基于所述第一矩阵中每个时间点的上多个KPI对应的差值确定每个时间点对应的一个KPI综合异常值;
    将每个时间点对应的KPI综合异常值与第一阈值确定所述每个时间点上所述第一业务是否异常;其中,所述第一阈值为第二阈值和第三阈值中的最大值;所述第二阈值为基于所述第一业务的所述多个KPI的未异常的历史值得到;所述第三阈值基于所述N个时间点的多个KPI的第一值与预设的异常百分比确定;
    当一个时间点上的KPI综合异常值大于所述第一阈值时则确定在所述时间点上所述第一业务异常;
    当一个时间点上的KPI综合异常值小于或等于所述第一阈值时则确定在所述时间点上所述第一业务未异常。
  17. 如权利要求16所述的装置,其特征在于,所述处理器,在基于所述N个时间点上的多个KPI的第一值与预设的异常百分比确定所述第三阈值时,具体用于:
    将所述N个时间点对应的N个KPI综合异常值从大到小排序,得到排序后的KPI综合异常值;其中,每个时间点对应的综合异常值是基于由所述N个时间点上的多个KPI的第一值得到的所述第一矩阵中的每个时间点的上多个KPI对应的差值确定的;
    根据所述预设的异常百分比,在所述排序后的KPI综合异常值中确定所述异常百分比对应的目标KPI综合异常值;
    将确定的所述目标KPI综合异常值作为所述第三阈值。
  18. 如权利要求13-17任一项所述的装置,其特征在于,所述处理器,在根据所述每个异常时间点上所述多个KPI的异常度确定所述第一业务的异常类型时,具体用于:
    将每个异常时间点上的多个KPI的异常度从高到低进行排序,得到每个时间点上多个 排序后的KPI的异常度;
    将每个异常时间点上前H个KPI的异常度对应的异常类型作为每个异常时间点上所述第一业务的异常类型;H为大于或者等于1的整数。
  19. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-6任意一项所述的方法。
  20. 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-6任意一项所述的方法。
PCT/CN2021/087603 2020-04-24 2021-04-15 一种异常检测方法及装置 WO2021213247A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010331814.9A CN113556258B (zh) 2020-04-24 2020-04-24 一种异常检测方法及装置
CN202010331814.9 2020-04-24

Publications (1)

Publication Number Publication Date
WO2021213247A1 true WO2021213247A1 (zh) 2021-10-28

Family

ID=78101242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087603 WO2021213247A1 (zh) 2020-04-24 2021-04-15 一种异常检测方法及装置

Country Status (2)

Country Link
CN (1) CN113556258B (zh)
WO (1) WO2021213247A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567542A (zh) * 2022-02-16 2022-05-31 烽火通信科技股份有限公司 硬管道专线逐跳业务检测方法、装置、设备及存储介质
CN115118631A (zh) * 2022-06-27 2022-09-27 平安银行股份有限公司 链路异常处理方法、装置、电子设备及存储介质
CN115599657A (zh) * 2022-12-15 2023-01-13 浪潮通信信息系统有限公司(Cn) 软件设施异常判断方法
CN116566859A (zh) * 2023-06-28 2023-08-08 广州极电通信技术有限公司 用于交换机的网络异常检测方法及系统
WO2023160406A1 (en) * 2022-02-23 2023-08-31 International Business Machines Corporation Neural network inference quantization
CN117435191A (zh) * 2023-12-20 2024-01-23 杭银消费金融股份有限公司 一种基于客制化需求的程序处理方法和装置
WO2024042307A1 (en) * 2022-08-24 2024-02-29 Vodafone Group Services Limited Computer implemented methods, systems and program instructions for detecting anomalies in a core network of a telecommunications network
CN117692163A (zh) * 2023-10-31 2024-03-12 青岛文达通科技股份有限公司 一种智慧城市数据处理方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328095A (zh) * 2021-12-21 2022-04-12 深圳前海微众银行股份有限公司 一种任务异常告警方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190059008A1 (en) * 2017-08-18 2019-02-21 T-Mobile Usa, Inc. Data intelligence in fault detection in a wireless communication network
CN109787846A (zh) * 2019-03-27 2019-05-21 湖北大学 一种5g网络服务质量异常监测和预测方法及系统
CN110278121A (zh) * 2018-03-15 2019-09-24 中兴通讯股份有限公司 一种检测网络性能异常的方法、装置、设备及存储介质
CN110635952A (zh) * 2019-10-14 2019-12-31 中兴通讯股份有限公司 通信系统的故障根因分析方法、系统和计算机存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281845A1 (en) * 2008-05-06 2009-11-12 International Business Machines Corporation Method and apparatus of constructing and exploring kpi networks
US9961571B2 (en) * 2015-09-24 2018-05-01 Futurewei Technologies, Inc. System and method for a multi view learning approach to anomaly detection and root cause analysis
CN110380888B (zh) * 2019-05-29 2021-02-23 华为技术有限公司 一种网络异常检测方法和装置
CN110995508B (zh) * 2019-12-23 2022-11-11 中国人民解放军国防科技大学 基于kpi突变的自适应无监督在线网络异常检测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190059008A1 (en) * 2017-08-18 2019-02-21 T-Mobile Usa, Inc. Data intelligence in fault detection in a wireless communication network
CN110278121A (zh) * 2018-03-15 2019-09-24 中兴通讯股份有限公司 一种检测网络性能异常的方法、装置、设备及存储介质
CN109787846A (zh) * 2019-03-27 2019-05-21 湖北大学 一种5g网络服务质量异常监测和预测方法及系统
CN110635952A (zh) * 2019-10-14 2019-12-31 中兴通讯股份有限公司 通信系统的故障根因分析方法、系统和计算机存储介质

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567542A (zh) * 2022-02-16 2022-05-31 烽火通信科技股份有限公司 硬管道专线逐跳业务检测方法、装置、设备及存储介质
CN114567542B (zh) * 2022-02-16 2023-09-15 烽火通信科技股份有限公司 硬管道专线逐跳业务检测方法、装置、设备及存储介质
WO2023160406A1 (en) * 2022-02-23 2023-08-31 International Business Machines Corporation Neural network inference quantization
CN115118631B (zh) * 2022-06-27 2023-09-01 平安银行股份有限公司 链路异常处理方法、装置、电子设备及存储介质
CN115118631A (zh) * 2022-06-27 2022-09-27 平安银行股份有限公司 链路异常处理方法、装置、电子设备及存储介质
WO2024042307A1 (en) * 2022-08-24 2024-02-29 Vodafone Group Services Limited Computer implemented methods, systems and program instructions for detecting anomalies in a core network of a telecommunications network
CN115599657B (zh) * 2022-12-15 2023-03-17 浪潮通信信息系统有限公司 软件设施异常判断方法
CN115599657A (zh) * 2022-12-15 2023-01-13 浪潮通信信息系统有限公司(Cn) 软件设施异常判断方法
CN116566859A (zh) * 2023-06-28 2023-08-08 广州极电通信技术有限公司 用于交换机的网络异常检测方法及系统
CN116566859B (zh) * 2023-06-28 2023-10-31 广州极电通信技术有限公司 用于交换机的网络异常检测方法及系统
CN117692163A (zh) * 2023-10-31 2024-03-12 青岛文达通科技股份有限公司 一种智慧城市数据处理方法
CN117692163B (zh) * 2023-10-31 2024-06-04 青岛文达通科技股份有限公司 一种智慧城市数据处理方法
CN117435191A (zh) * 2023-12-20 2024-01-23 杭银消费金融股份有限公司 一种基于客制化需求的程序处理方法和装置
CN117435191B (zh) * 2023-12-20 2024-03-26 杭银消费金融股份有限公司 一种基于客制化需求的程序处理方法和装置

Also Published As

Publication number Publication date
CN113556258B (zh) 2022-12-27
CN113556258A (zh) 2021-10-26

Similar Documents

Publication Publication Date Title
WO2021213247A1 (zh) 一种异常检测方法及装置
CN111178456B (zh) 异常指标检测方法、装置、计算机设备和存储介质
US20220036264A1 (en) Real-time adaptive operations performance management system
Fernandes et al. Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review
WO2021139235A1 (zh) 系统异常检测方法、装置、设备及存储介质
CN111475804B (zh) 一种告警预测方法及系统
WO2021174835A1 (zh) 告警信息处理方法、装置、计算机装置及存储介质
US11403164B2 (en) Method and device for determining a performance indicator value for predicting anomalies in a computing infrastructure from values of performance indicators
AU2018203375A1 (en) Method and system for data based optimization of performance indicators in process and manufacturing industries
US20230385034A1 (en) Automated decision making using staged machine learning
US20180268291A1 (en) System and method for data mining to generate actionable insights
US11307916B2 (en) Method and device for determining an estimated time before a technical incident in a computing infrastructure from values of performance indicators
RU2686257C1 (ru) Способ и система удалённой идентификации и прогнозирования развития зарождающихся дефектов объектов
US11675643B2 (en) Method and device for determining a technical incident risk value in a computing infrastructure from performance indicator values
Dou et al. Pc 2 a: predicting collective contextual anomalies via lstm with deep generative model
JP2023547849A (ja) ラベルなしセンサデータを用いた産業システム内の稀な障害の自動化されたリアルタイムの検出、予測、及び予防に関する、方法または非一時的コンピュータ可読媒体
US20230105304A1 (en) Proactive avoidance of performance issues in computing environments
CN109558298A (zh) 基于深度学习模型的告警执行频率优化方法及相关设备
Buda et al. ADE: an ensemble approach for early anomaly detection
CN116661954B (zh) 虚拟机异常预测方法、装置、通信设备及存储介质
Yang et al. FP-STE: a novel node failure prediction method based on spatio-temporal feature extraction in data centers
CN115427986A (zh) 用于从大容量、高速流式数据动态生成预测分析的算法学习引擎
WO2023103344A1 (zh) 一种数据处理方法、装置、设备及存储介质
US20230409460A1 (en) System and method for optimizing performance of a process
Ohlsson Anomaly detection in microservice infrastructures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792279

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21792279

Country of ref document: EP

Kind code of ref document: A1