WO2023241484A1 - 异常事件处理方法、电子设备及存储介质 - Google Patents

异常事件处理方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2023241484A1
WO2023241484A1 PCT/CN2023/099448 CN2023099448W WO2023241484A1 WO 2023241484 A1 WO2023241484 A1 WO 2023241484A1 CN 2023099448 W CN2023099448 W CN 2023099448W WO 2023241484 A1 WO2023241484 A1 WO 2023241484A1
Authority
WO
WIPO (PCT)
Prior art keywords
aggregation
abnormal
event
target
time
Prior art date
Application number
PCT/CN2023/099448
Other languages
English (en)
French (fr)
Inventor
姜磊
罗秋野
文秀林
孟照星
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023241484A1 publication Critical patent/WO2023241484A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications

Definitions

  • This application relates to but is not limited to the field of communication technology, and in particular, to an abnormal event processing method, electronic equipment and storage media.
  • Embodiments of the present application provide an abnormal event processing method, electronic device, and storage medium.
  • embodiments of the present application provide a method for processing abnormal events.
  • the method includes: acquiring multiple abnormal events at a target location within a preset time period.
  • the abnormal events include alarms, key performance indicator exceptions, and operation At least one of the logs; determining an aggregation point in the abnormal event; performing aggregation according to the aggregation point and the abnormal event to obtain an aggregation result.
  • an embodiment of the present application provides an electronic device, including: a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, the implementation is as in the first embodiment of the present application. Any of the above exception event handling methods.
  • embodiments of the present application provide a computer-readable storage medium, the storage medium stores a program, and the program is executed by a processor to implement the exception described in any one of the embodiments of the first aspect of the present application. Event handling methods.
  • Figure 1 is a schematic flowchart of an abnormal event processing method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • FIG. 3 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 4 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 5 is a schematic diagram of a target cache area provided by an embodiment of the present application.
  • Figure 6 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 7 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 8 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 9 is a schematic diagram of forward and backward bidirectional aggregation using communication exceptions as aggregation points provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of forward and backward bidirectional aggregation using network unreachable alarms as aggregation points provided by an embodiment of the present application;
  • Figure 11 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 12 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 13 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 14 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 15 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present application.
  • Figure 16 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • Aggregation analysis refers to aggregating relevant streaming data and then analyzing to identify the root cause of the fault.
  • Aggregation has two dimensions: time and space. This is also the so-called spatio-temporal aggregation.
  • Aggregation in the spatial dimension can use topological resource correlation, such as related data of the same network element, the same link, and the same computer room; aggregation in the time dimension, that is, related data is aggregated within a certain time range, which is similar to aggregation in the spatial dimension. Differently, aggregation of the time dimension is relatively difficult, mainly because the time range is not easy to determine.
  • the second rule is that in the same network element, in a 10-minute window, if the optical module receives optical power abnormality and the RRU link down alarm occurs at the same time, it is considered that the two alarms can be aggregated;
  • the third rule is that for the same network element, within a 15-minute window, the RRU link disconnection alarm and the distributed unit (Distributed Unit, DU) cell outage alarm are considered to be aggregated.
  • the RRU link disconnection alarm and the distributed unit (Distributed Unit, DU) cell outage alarm are considered to be aggregated.
  • the time in the time dimension is difficult to determine, and the design idea of requiring time steps cannot be determined by the maximum time of 15 minutes for several rules, nor can it be determined by the sum of the times of all relevant rules.
  • OK not only that, but there is also a key dependency, that is, the time dimension completely relies on backward judgment, that is, the occurrence of alarm A causes the occurrence of alarm B, then the occurrence time of alarm A will be before the occurrence of alarm B, then the occurrence of alarm A Then predict the time when alarm B occurs.
  • the alarm occurrence time may be different, and the B alarm may even be before the A alarm, and multiple data sources cannot be aggregated, because if Only alarms and key performance indicators are abnormal. Relatively speaking, it is easy to determine the time when the exception occurs, that is, there is a clear abnormal data source. If the alarm causes the key performance indicator to deteriorate or be abnormal, then the alarm occurs first and the alarm is aggregated later. But if it is a certain A certain operation causes a related alarm. For example, the log of a certain operation is before the alarm, which is inconvenient to aggregate backwards. Since the log is inconvenient to clearly identify abnormalities, it cannot be perceived immediately.
  • the solutions in related technologies have technical flaws. If the alarm is triggered by some operation before the alarm, such aggregation cannot clarify the root cause of the fault. Therefore, the aggregation capability is low, resulting in low fault operation and maintenance levels.
  • embodiments of the present application provide an abnormal event processing method, electronic device, and storage medium, which can realize two-way aggregation, improve the aggregation capability of data sources, and improve the level of fault operation and maintenance.
  • the embodiment of the present application provides an abnormal event processing method.
  • the abnormal event processing method in the embodiment of the present application includes but is not limited to step S101 to step S103.
  • Step S101 Obtain multiple abnormal events at the target location within a preset time period.
  • the abnormal events include at least one of alarms, key performance indicator abnormalities, and operation logs.
  • Step S102 Determine the aggregation point in abnormal events.
  • Step S103 Aggregate based on aggregation points and abnormal events to obtain aggregation results.
  • the abnormal event processing method in the embodiment of the present application can be applied in communication equipment. By executing the abnormal event processing method, two-way aggregation can be achieved, the aggregation capability of data sources can be improved, and the level of fault operation and maintenance can be improved.
  • multiple abnormal events at the target location can be obtained within a preset time period, and the aggregation point can be determined among the obtained abnormal events.
  • the abnormal events include alarms, key performance indicator (KPI) exceptions and operation logs. At least one.
  • the abnormal events include one of alarms or key performance indicator exceptions, and also include operation logs.
  • the abnormal events include alarms, key performance indicator exceptions and operation logs.
  • the embodiment of this application takes the above three as an example to illustrate.
  • the aggregation point is a point set according to the aggregation needs.
  • the type of aggregation point can be configured through specified.
  • the embodiment of the present application performs aggregation based on aggregation points and abnormal events to obtain aggregation results.
  • the aggregation point is one or more of many abnormal events. Since abnormal events are continuously acquired within a preset time period, Therefore, the time of the aggregation point is in the middle of the preset time period. It can be understood that the aggregation of the aggregation point can contain multiple abnormal events before and after the acquisition time of the aggregation point. The abnormal events at these times before and after can be alarms. , at least one of key performance indicator anomalies and operation logs.
  • an alarm or key performance indicator anomaly when triggered in advance due to some operation before the aggregation point, it can be aggregated before and after the aggregation point.
  • the data obtained, the aggregation results can be used to clarify the root cause of the fault and achieve two-way aggregation, which can improve the aggregation capabilities of the data source and improve the level of fault operation and maintenance.
  • the preset time period in the embodiment of this application can be set according to actual operation and maintenance needs.
  • the preset time period can be 20 minutes, 1 hour, 4 hours or longer.
  • the embodiment of the present application can start to obtain abnormal events as data sources, including obtaining at least one of alarms, key performance indicator exceptions and operation logs, to achieve the acquisition of multiple data sources.
  • the predetermined The data source is obtained within a set time period.
  • the aggregation time can be clarified in the time dimension.
  • the target location in the embodiment of this application is a location in the spatial dimension.
  • the target location can be a network element, a computer room or a link.
  • the root cause of the fault of the network element, computer room or link can be obtained.
  • step S101 may also include but is not limited to step S201 and step S202.
  • Step S201 Create multiple time buckets in the target cache area according to the total duration of the preset time period.
  • the time buckets are composed of timestamp intervals.
  • the duration of each time bucket is the same and the time of two adjacent time buckets is continuous.
  • Step S202 Continuously acquire multiple abnormal events at the target location and cache them in the time bucket corresponding to the acquisition time of each abnormal event.
  • the embodiment of the present application realizes the caching method of abnormal events by setting time buckets.
  • the embodiment of the present application establishes multiple time buckets in the target cache area according to the total duration of the preset time period.
  • the time buckets are composed of
  • the timestamp interval is composed of the same duration of each time bucket and the time of two adjacent time buckets is continuous.
  • the target cache area is the cache area corresponding to the target location cache.
  • One target location can correspond to multiple cache areas, or correspond to a one-to-one cache area.
  • the embodiment of this application uses a two-way time dimension for aggregation, Each cache area caches abnormal events according to the timestamp and a certain time interval as a time bucket. Therefore, after caching the abnormal events, it is not necessary to aggregate them immediately. You also need to wait for a certain period of time until the time bucket is cached. Start preparing for aggregation.
  • the duration of each time bucket is the same and the times of two adjacent time buckets are continuous.
  • the length of each time bucket can be set to 5 minutes, so you can get 4 consecutive time buckets.
  • the time of the first time bucket is cached from minute 0 to minute 5
  • the time of the second time bucket is cached from minute 0 to minute 5.
  • the cache is cached from 5 minutes to the 10th minute
  • the third time bucket is cached from the 10th minute to the 15th minute
  • the fourth time bucket is cached from the 15th minute to the 20th minute, and so on for longer preset time periods.
  • the length of each time bucket can be set according to actual operation and maintenance needs, and there are no specific restrictions here.
  • the abnormal event of stopping caching can be controlled by setting the condition that the time bucket in the target cache area is no longer cached.
  • step S202 may also include but is not limited to step S301 to step S303.
  • Step S301 Obtain convergence conditions for stopping caching of abnormal events within a preset time period.
  • Step S302 Continuously acquire multiple abnormal events at the target location starting from the start time of the preset time period, and cache them in the time bucket of the corresponding time according to the acquisition time of each abnormal event.
  • Step S303 When the cached abnormal events meet the convergence condition, stop caching the abnormal events.
  • the embodiment of the present application caches abnormal events at the target location by setting the time bucket mode, and controls it by setting the convergence conditions of the time bucket.
  • This embodiment of the present application obtains the convergence conditions for stopping the caching of abnormal events within a preset time period.
  • the target is continuously obtained starting from the starting time of the preset time period. Multiple abnormal events at the location, and cache them in the time bucket of the corresponding time in sequence according to the acquisition time of each abnormal event. When the cached abnormal events meet the convergence conditions, stop caching the abnormal events.
  • the target cache When the convergence conditions are met, the target cache The area cache is completed, that is, the cache area is closed and no longer receives other abnormal events.
  • the time bucket is encapsulated and prepared for aggregation.
  • the target cache area can also be cleared after the convergence conditions are met to wait for subsequent exception event caching.
  • Setting convergence conditions to control when to stop caching exception events eliminates the need to waste data collection time, improves aggregation efficiency and improves aggregation capabilities in the time dimension.
  • the currently cached data already contains bidirectional data in the time dimension of the aggregation point.
  • the abnormal events that is, the abnormal events before and after the aggregation point as the center have entered the target cache area, which improves the aggregation ability of the data, so that two-way aggregation can be performed.
  • the convergence condition may include but is not limited to at least one of steps S401 to S403.
  • Step S401 The time of obtaining the abnormal event exceeds the end time of the preset time period.
  • Step S402 The deceleration rate of the number of cached abnormal events between multiple consecutive time buckets is less than the preset target deceleration rate.
  • Step S403 The number of abnormal events in the time bucket is less than the preset minimum threshold for the number of events in the bucket.
  • convergence conditions there can be multiple convergence conditions in the embodiment of the present application. It can be determined in the time dimension when the data source collection is completed to improve the aggregation efficiency and aggregation capability in the time dimension.
  • the convergence conditions can include step S401 To at least one of steps S403, it can be understood that when one of the convergence conditions in the above steps is met, it can be determined that the data source collection is completed, and therefore the caching of exception events is stopped.
  • determining whether the time for obtaining abnormal events exceeds the end time of the preset time period is one of the convergence conditions.
  • the preset time period has a start time and an end time.
  • the end time indicates the cache time expiration, that is, the time interval from the time of the last aggregation point of this cache to the end time.
  • This time interval is the maximum value of the preset time period, which limits excessive waiting, and is implemented through this application.
  • the convergence condition of the example implements forced termination of caching.
  • the maximum value of the aggregation time interval, that is, the preset time period is set to 60 minutes. After this time, no further messages will be waited for.
  • judging whether the deceleration rate of the number of cached abnormal events between multiple consecutive time buckets is less than the preset target deceleration rate is one of the convergence conditions.
  • the number of abnormal events in three consecutive time buckets is set to decrease at a certain rate
  • it is lower than the target decrement rate it is judged that there is no need to cache abnormal events, such as when the target is decrementing
  • the rate is 25%, and the value of the target decline rate can be set according to actual needs.
  • each time bucket in the target cache area caches user login logs, configuration routing logs, restart routing logs, communication exceptions, network unreachable alarms, and key performance indicator exceptions (KPI exceptions in the figure) , business exceptions, business restart alarms and other abnormal events, among which communication abnormalities, network unreachable alarms, key performance indicator exceptions and business restart alarms are the aggregation points.
  • the number of abnormal events in time bucket 4 is time bucket 3
  • One-third of the buffer is higher than the set target deceleration rate (25%). Therefore, the cache cannot be ended yet and the exception events can continue to be received.
  • judging whether the number of abnormal events in a time bucket is less than the preset minimum threshold for the number of events in the bucket is one of the convergence conditions.
  • the number of abnormal events in a time bucket is less than the minimum threshold for the number of events in the bucket, that is, the minimum number of events in the bucket.
  • the minimum threshold of the number of events in the bucket can be set according to actual needs. When it is lower than the minimum threshold of the number of events in the bucket, it is judged that it is no longer necessary to cache abnormal events.
  • the embodiment of this application implements the end of caching after the events converge within a certain period of time.
  • time bucket 4 receives only 1 event, which is less than the minimum threshold of the number of events in the bucket (assumed to be 2), that is, it no longer needs to receive abnormal events, and stops caching abnormal events, that is, time bucket 5 does not need to receive any more.
  • step S201 may also include but is not limited to step S501 and step S302.
  • Step S501 Obtain the preset time period corresponding to each target position.
  • Step S502 Create target cache areas corresponding to each target location, and create multiple time buckets in the corresponding target cache areas according to the total duration of each preset time period.
  • the embodiment of the present application caches data sources according to different target locations.
  • Each different location can correspondingly set a preset time period required for its own cache.
  • This application obtains the preset time period corresponding to each target location, caches the data of each target location, and establishes a target cache area corresponding to each target location.
  • the target cache area is the same as the target location.
  • each target location has a corresponding target cache area, and multiple time buckets are established in the corresponding target cache area according to the total duration of each preset time period, so as to realize all the abnormal times of each target location. Cache into the corresponding time bucket.
  • the target location in the embodiment of this application is a location in the spatial dimension.
  • the target location can be a network element, a link, or an equipment room.
  • Multiple target locations can include multiple network elements, links, and equipment rooms.
  • the aggregation results of each different target location can be obtained, so that the aggregation analysis of each target location can be performed.
  • the aggregation results of each target location can be obtained by This aggregation result can analyze the fault root cause of each target location itself, and can also obtain the overall aggregation result of multiple target locations. Through this aggregation result, the root cause of faults in multiple target locations can be analyzed, which improves the aggregation capability and improves the fault Operation and maintenance level.
  • step S102 may also include but is not limited to step S601 and step S602.
  • Step S601 Obtain the filtering conditions of the aggregation point.
  • Step S601 Determine the abnormal event that satisfies the filtering conditions among multiple abnormal events as the aggregation point.
  • the embodiment of the present application can obtain the filtering conditions of the aggregation point, determine the aggregation point from the abnormal events, and determine which are major alarms or major key performance indicator abnormalities from the abnormal events through the filtering conditions.
  • the major alarm can be Any of the alarms in abnormal events, major key performance indicator abnormalities can be any of the key performance indicator abnormalities in abnormal events, for example, the aggregation point such as base station out of service, cell out of service, etc.
  • the aggregation point is the real operation and maintenance Center, aggregation centered on the aggregation point of major alarms or major key performance indicator anomalies is suitable for actual operation and maintenance needs. Otherwise, aggregation of a large number of ordinary alarms will waste a lot of time, resulting in low operation and maintenance levels.
  • the alarm types and key aggregation points Performance Exception types can be specified through configuration.
  • the filtering conditions can be customized according to actual operation and maintenance needs to determine major alarms or major key performance indicator anomalies.
  • the aggregation point is one or more of the many abnormal events. Since abnormal events are continuously acquired within the preset time period, the time of the aggregation point is located in the middle of the preset time period. It can be understood that the aggregation of the aggregation point can include before and after the acquisition time of the aggregation point. Multiple abnormal events. These abnormal events at the time before and after can be at least one of alarms, key performance indicator exceptions, and operation logs. Therefore, in the embodiment of the present application, an alarm or alarm may be triggered by a certain operation before the aggregation point.
  • the data before and after the aggregation point can be aggregated.
  • the obtained aggregation results can be used to clarify the root cause of the fault and achieve two-way aggregation, which can improve the aggregation capabilities of the data source and improve the level of fault operation and maintenance.
  • step S103 may also include but is not limited to steps S701 to S703.
  • Step S701 Determine the first target event and the second target event among the abnormal events, where the first target event is characterized as a noise event at the aggregation point, and the second target event is characterized as a related event at the aggregation point.
  • Step S702 Clear the first target event and retain the second target event.
  • Step S703 Perform aggregation based on the aggregation point and the second target event to obtain an aggregation result.
  • the embodiment of the present application can denoise abnormal events, remove unnecessary events, and retain the abnormal events related to the aggregation point, so as to improve the aggregation capability.
  • it can be determined in the abnormal events The first abnormal event and the second abnormal event.
  • the first target event is represented as a noise event at the aggregation point
  • the second target event is represented as an associated event at the aggregation point.
  • a noise event if aggregated with the aggregation point, the final aggregation will be The amount of data in the results is too large, and there are many abnormal events that are useless for fault root cause analysis.
  • the noise event characterized as the aggregation point can be determined, that is, the first target event is determined, and the noise event characterized as the aggregation point can be determined.
  • the associated event that is, the representation of the second target event, clears the first target event and retains the second target event.
  • aggregation can be performed based on the aggregation point and the second target event to obtain the aggregation result, which can improve the aggregation capability of the embodiment of the present application. Improve fault operation and maintenance levels.
  • the abnormal events in the embodiment of the present application include operation logs, during the actual operation and maintenance process, there will be a large number of operation logs.
  • the data before and after the aggregation point can be obtained.
  • Abnormal events can be used to obtain the aggregation results, that is, the operation logs before and after the aggregation point can be obtained to obtain the aggregation results.
  • the root cause of the fault can be determined based on the aggregation results to find the operation logs that caused the aggregation point exception.
  • the embodiment of the present application ultimately ensures the aggregation capability and efficiency of the embodiment of the present application by clarifying the first target event and the second target event in the abnormal event, clearing the first target event and retaining the second target event.
  • communication exceptions can be aggregated in both directions as shown in Figure 9.
  • user login logs, configuration routing logs and restarts can be aggregated.
  • backward aggregation can aggregate network unreachable alarms, key performance indicator anomalies, business anomalies and other abnormal events.
  • communication anomalies can be processed as shown in Figure 10 Two-way aggregation forward and backward.
  • User login logs, configuration routing logs, restart routing logs, communication exceptions, etc. can be aggregated forward.
  • Abnormal events, backward aggregation can aggregate abnormal events such as key performance indicator anomalies, business anomalies, and business restart alarms.
  • step S103 may also include but is not limited to step S801 and step S802.
  • Step S801 Aggregate aggregation points and abnormal events to obtain an aggregation package.
  • Step S802 Perform root cause identification on the aggregation package, and combine the abnormal events corresponding to each aggregation point to obtain the root cause identification result of the aggregation point.
  • root cause identification can be performed in the embodiment of the present application to obtain the root cause identification result.
  • an aggregate package is obtained by aggregating aggregation points and abnormal events
  • root cause identification is performed on the aggregate package
  • the root cause identification result of the aggregation point is obtained by combining the abnormal events corresponding to each aggregation point.
  • aggregation is performed based on the aggregation point and the second target event to obtain an aggregation package. By analyzing the second target event Perform aggregation to obtain an aggregation package with higher aggregation efficiency, and identify the root cause of these useful abnormal events. You can use the second target event and knowledge base in the aggregation point and other technologies to analyze which abnormal event is the root cause event, thus improving the efficiency of the aggregation. Failure operation and maintenance level.
  • step S701 may also include but is not limited to step S901 and step S902.
  • Step S901 Perform initialization processing on the abnormal events to obtain initial data, and input the initial data into the preset two-way aggregation model for probability calculation to obtain the noise probability values of each abnormal event and the corresponding aggregation point.
  • Step S902 Determine the first target event and the second target event among the abnormal events according to the noise probability value.
  • the embodiment of the present application determines the first target event and the second target event in the abnormal event by obtaining a preset two-way aggregation model.
  • the two-way aggregation model is a kind of data obtained through neural network model training.
  • Processing model in the embodiment of this application, the initial data is obtained by initializing the abnormal events, and the initial data is input into the preset two-way aggregation model for probability calculation, and the noise probability values of each abnormal event and the corresponding aggregation point are obtained.
  • the input of the two-way aggregation model needs to match the corresponding initial data so that the two-way aggregation model can perform data processing.
  • the noise probability value can represent the probability that the abnormal event is a noise event at the corresponding aggregation point.
  • the probability represented by the noise probability value is It can be determined whether the abnormal event is a noise event of the corresponding aggregation point, thereby determining the first target event and the second target event.
  • each abnormal event can be subject to probability calculation through a two-way aggregation model to obtain the noise probability value for each aggregation point. This is because some abnormal events have low probability for some aggregation points, but high probability for other aggregation points. Therefore, calculating the probability of each abnormal event and each aggregation point can avoid removing some high-probability abnormal events. Helps to aggregate all aggregation points.
  • step S902 may also include but is not limited to steps S1001 to S1003.
  • Step S1001 Obtain the first probability threshold and the second probability threshold of each aggregation point.
  • Step S1002 Determine abnormal events corresponding to noise probability values lower than all first probability thresholds as first target events.
  • Step S1003 Determine an abnormal event corresponding to a noise probability value higher than any second probability threshold as a second target event.
  • the embodiment of the present application filters abnormal events by setting a low probability threshold and a high probability threshold.
  • the embodiment of the present application can obtain the first probability threshold and the second probability threshold of each aggregation point.
  • the first probability The threshold is low probability
  • the threshold is used to filter out the first target event among the abnormal events. Therefore, the abnormal event corresponding to the noise probability value lower than all the first probability thresholds is determined as the first target event.
  • the first target event is a low probability event
  • the second target event is determined as the first target event.
  • the probability threshold is a high probability threshold, used to filter out the second target event, and determine the abnormal event corresponding to the noise probability value higher than any second probability threshold as the second target event, and the second target event is a high probability event.
  • marks that are lower than the low probability threshold will be used as the first probability threshold of the low probability threshold, which can be set in the interface or configuration file, and configured according to actual operation and maintenance needs, and marks that are higher than the high probability threshold will be Abnormal events are placed in a high probability list.
  • the second probability threshold as the high probability threshold can be set in the interface or configuration file and configured according to actual operation and maintenance needs.
  • the key value is the abnormal point and the value is a list.
  • Abnormal events above the high probability threshold, and abnormal events below the low probability threshold are temporarily excluded when analyzing each aggregation point, because some abnormal events are low probability for some aggregation points but high probability for other aggregation points, so
  • the noise probability value of the abnormal event when determining which abnormal events are the first target events, it is required that the noise probability value of the abnormal event is lower than the first probability threshold of all aggregation points before it is determined as the first target event, and the second target event is judged to be When a target event occurs, the noise probability value of the abnormal event only needs to be higher than the second probability threshold of any aggregation point to determine it as the second target event.
  • the first probability threshold is used to predict the context-related events of this aggregation point using a certain aggregation point. If the probability of some abnormal events is very low, the probabilities for all aggregation points are lower than the first probability threshold. , if set to 10%, it can be used as noise denoising; the second probability threshold, when using a certain aggregation point to predict the context-related events of this aggregation point, if the probability of certain abnormal events is higher than the second probability threshold , if set to 75%, it can be considered that the correlation is very strong and can assist in subsequent root cause analysis.
  • step S901 may also include but is not limited to steps S1101 to S1103.
  • Step S1101 Perform one-hot encoding on the abnormal event to obtain initialized initial vector data.
  • Step S1102 Obtain a preset bidirectional aggregation model, where the bidirectional aggregation model is obtained through unsupervised training based on sample abnormal events in the acquired samples, sample target events characterized as noise events, and sample aggregation points.
  • Step S1103 Input the initial vector data into the preset two-way aggregation model for probability calculation, and obtain the noise probability values of each abnormal event and the corresponding aggregation point.
  • the two-way aggregation model can be established in advance based on the data in the sample.
  • Some events are important to the aggregation point. They are noise events, such as some daily operation operation logs, and flash alarms happen to be in a certain time window with the abnormal point. Their existence interferes with the aggregate analysis, so they are filtered through artificial intelligence (AI) training and a certain probability. , which can make aggregate analysis more accurate.
  • AI artificial intelligence
  • the Continuous Skip-Gram Model (Skip-gram) model of word vectorization uses the center word to predict context words. This principle of probability is used in the embodiment of this application. After vectorizing the abnormal events, the probability of abnormal events in the time period before and after the aggregation point pair is obtained through the two-way aggregation model, and the abnormal events are denoised through the probability threshold.
  • a trainer can be set up to load historical data.
  • the historical data can include sample abnormal events in the sample, sample target events characterized as noise events, and sample aggregation points.
  • sample aggregation points In the training phase, these data in the sample After performing one-hot coding, through unsupervised training, when the context probability is the largest, the loss function is the smallest. If it is small, abnormal events can be vectorized, and a bidirectional aggregation model can be obtained through training, which will be needed for subsequent applications.
  • the trained two-way aggregation model when performing probability calculations, is first loaded, one-hot encoding is performed on the abnormal events, and the initial vector data after initialization is obtained, and the initial vector data is input into the preset two-way aggregation model. Probability calculations are used to obtain the noise probability values of each abnormal event and the corresponding aggregation point.
  • the money input into the bidirectional aggregation model has been expressed as an initialization vector through one-hot encoding. Therefore, the bidirectional model obtained through training and release can be used directly to calculate the probability of the event. Calculate, perform vector probability calculation on abnormal events, and obtain the noise probability value of each abnormal event and the corresponding aggregation point.
  • the implementation of this application can use but is not limited to the Skip-gram model of Word2vec for training, including the construction of neural networks to obtain the required two-way aggregation model.
  • the Skip-gram model is used, and its loss function is to minimize all probabilities.
  • the correlation of different alarms, operation logs and other abnormal events is obtained through training to obtain the intermediate hidden layer. This This is the final required model.
  • the steps of training to obtain the bidirectional aggregation model will not be described in detail in the embodiments of this application.
  • step S901 may also include but is not limited to step S1201 and step S1202.
  • Step S1201 Sort multiple aggregation points according to time and store them in an aggregation point list.
  • Step S1202 input the initial data into the preset two-way aggregation model, and perform probability calculations on abnormal events according to each aggregation point in the aggregation point list to obtain the noise probability value of each abnormal event and the corresponding aggregation point.
  • the embodiment of the present application stores aggregation points by establishing an aggregation point list. After the target buffer stops caching abnormal events, the target buffer is closed and no longer receives other abnormal events, and the time bucket is encapsulated into an initial package for preparation. aggregation, and then clear the cache area to wait for subsequent event caching. It needs to be emphasized that from the perspective of the aggregation point, the difference from the conventional one-way backward aggregation is that when closing the target cache area, the embodiment of the present application already includes The two-way events before and after, that is, the events before and after the aggregation point as the center have entered the cache area.
  • the two-way aggregation in the embodiment of the present application is based on the aggregation point. Therefore, the aggregation point of the target cache area can be collected. If there is no aggregation point, the target cache area is directly recycled for the next cache. If there may be multiple aggregation points in a target cache area or even a bucket in the cache area, the aggregation points are collected first, sorted by time, and stored in the aggregation point In the list, the exception events with the earliest occurrence time and the latest occurrence time in the cache area are then given. In addition, the location of the target cache area can also be given, such as network elements, computer rooms, or links.
  • the embodiment of the present application first obtains a list of aggregation points, and then uses a two-way model to obtain the noise probability rate of other abnormal events in the data for the aggregation points in the list, which will be lower than the low probability threshold ( mark, put the abnormal events higher than the high probability threshold in a high probability list, the key value is the abnormal point, the value is the list, these abnormal events higher than the high probability threshold are saved in the list, and finally after denoising,
  • the aggregation point can be appended with a high probability list, combined into an aggregation package, and sent to root cause identification.
  • the abnormal event processing method in the embodiment of the present application can be applied in an abnormal event processing device, referred to as a processing device, and the processing device may include:
  • Cache receives external exception events into the cache area, assembles them into initial packages and sends them to the packager;
  • Packer receives the initial packet, packages it into an encoded packet and sends it to the aggregator;
  • Aggregator receives the coding packet, performs context probability training and prediction centered on the aggregation point, denoises, obtains the aggregation packet, and sends it to root cause analysis;
  • the communication connection between the trainer, the cache, the packager, the aggregator and the trainer, and the above implementation is executed through the processing device
  • the exception event handling method in the example can include the following four steps:
  • the first step the trainer trains the bidirectional aggregation model to complete event vectorization.
  • the Skip-gram model of Word2vec uses the principle of predicting the probability of context words using the center word.
  • the embodiment of this application uses the same principle to vectorize the abnormal events and obtain the time before and after the aggregation point pair through the two-way aggregation model.
  • the probability of abnormal events in the segment is determined, and the abnormal events are denoised through the probability threshold.
  • the trainer loads historical alarms, logs, key performance indicator anomalies, and troubleshooting manuals as a corpus for one-hot encoding, through unsupervised training, when the context probability is the largest and the loss function is the smallest, the abnormal events can be vectorized, then we get Bi-directionally aggregate models and then publish the model.
  • Step 2 The cache receives and caches streaming exception events.
  • Exception events are streaming input, so exception events need to be cached for a certain period of time.
  • the cache sets different cache areas according to different spatial dimensions, that is, different locations.
  • One cache area can only cache abnormal events in the same spatial dimension.
  • Each cache area is cached as a time bucket according to the timestamp and a certain time interval. Exception events, if the event is a convergence point, mark it.
  • a time bucket is a time interval, such as five minutes, and the abnormal events of these 5 minutes are cached in it.
  • the next time bucket caches the abnormal events of the next time interval, such as five minutes.
  • a time bucket caches a batch of exception events every 5 minutes.
  • the size of the bucket may be different at different times.
  • a cache area consists of one or more time buckets.
  • the key is when to end, that is, the cache is completed and can be aggregated.
  • the embodiment of this application uses three dimensions as convergence conditions to complete the cache of the last time bucket:
  • the cache time expires, that is, the time interval from the time of the last aggregation point of this cache to the deadline. This time interval is the maximum time interval. It limits excessive waiting. This approach is to force the end;
  • the number of abnormal events in three consecutive time buckets decreases at a certain rate, which is lower than the event number decrement ratio. That is, the number of subsequent buckets is lower than the number of previous buckets by a certain percentage, such as 25%. This number can be set. This approach ends after the marginal effect decreases. , as shown in Figure 5, the number of events in time bucket 4 is one-third of that in time bucket 3, so it cannot end yet and continue to receive abnormal events;
  • the number of abnormal events in the time bucket is less than the minimum threshold of the number of events in the bucket, that is, the minimum number of events in the bucket. This value can be set. This method is to end after the events converge within a certain period of time. Also shown in Figure 5, the events received by time bucket 4 There is only 1, which is less than the minimum threshold of the number of events in the bucket (assumed to be 2), that is, it does not need to be received anymore, that is, time bucket 5 does not need to receive abnormal data anymore.
  • the buffer area is cached, that is, the buffer area is closed and no longer receives other abnormal events.
  • the time bucket is encapsulated into an initial package and sent to the packager to prepare for aggregation, and then the buffer area is cleared to wait for Later event caching.
  • Step 3 The packager performs the initial package for packaging.
  • the packager packages the initial package into an aggregate package.
  • the two-way aggregation in this application is based on the aggregation point. Therefore, the packager first collects the aggregation points of this cache area. If there is no aggregation point, the cache The area is directly recycled for the next cache. If there may be multiple aggregation points in a cache area or even a bucket in the cache area, first collect the aggregation points, sort them by time, and then give the earliest exception in the cache area. Events and abnormal events with the latest occurrence time are given, and the location of this cache area is given, such as network elements, computer rooms, or links. One-hot encoding is performed on the abnormal events in this cache area. After completing the above operations, the packager completes packaging, and we get Encoding package, the packager sends the encoding package to the aggregator for aggregation.
  • Step 4 Aggregation.
  • the aggregator loads the trained bidirectional aggregation model and performs forward and backward bidirectional denoising on the aggregation points in the aggregation package to complete the aggregation.
  • the aggregator receives the encoded packet sent by the packer. Since it has been one-hot encoded, it can directly use the bidirectional model obtained through training and release to vectorize the event and vectorize the probability of abnormal events in the packet. calculate.
  • the aggregator first obtains a list of aggregation points, and then uses a two-way model to obtain the probability of other abnormal events in this package for the aggregation points in the list. Marks that are lower than the low probability threshold (can be set in the interface or configuration file) will be higher than the high probability threshold. Abnormal events with probability thresholds (which may also be aggregation points) are placed in a high probability list, with the key value being the abnormal point and the value being the list. The list stores these abnormal events that are higher than the high probability threshold.
  • abnormal events below the low probability threshold are not temporarily excluded when analyzing each aggregation point, because some abnormal events are low probability for some aggregation points but high probability for other aggregation points.
  • all abnormal events marked with low probability are viewed and cleared if their probability for all aggregation points is lower than the minimum probability threshold.
  • the aggregator After the aggregator has analyzed all aggregation points, it can also conduct a second check on the marked abnormal events that are lower than the low probability threshold to see whether they are lower than low probability for each aggregation point. If not, they will be retained. Otherwise, denoising is performed. After denoising, the aggregator attaches a high probability list to the aggregation points in the coding packet, combines them into an aggregation packet, and sends it to root cause identification. This aggregation is completed.
  • Figure 16 shows an electronic device 100 provided by an embodiment of the present application.
  • the electronic device 100 includes: a processor 110, a memory 120, and a computer program stored on the memory 120 and executable on the processor 110.
  • the computer program When the computer program is run, it is used to execute the above-mentioned abnormal event processing method.
  • the processor 110 and the memory 120 may be connected through a bus or other means.
  • the memory 120 can be used to store non-transitory software programs and non-transitory computer executable programs, such as the abnormal event processing method described in the embodiments of this application.
  • the processor 110 implements the above-mentioned exception event processing method by running non-transient software programs and instructions stored in the memory 120 .
  • the memory 120 may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required for at least one function; the storage data area may store the above-mentioned exception event handling method. Additionally, memory 120 may include high-speed random access memory 120 and may also include non-transitory memory 120, such as at least one storage device storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 120 may include memory 120 located remotely relative to the processor 110 , and these remote memories 120 may be connected to the electronic device 100 through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the non-transitory software programs and instructions required to implement the above-mentioned exception event processing method are stored in the memory 120.
  • the above-mentioned exception event processing method is executed, for example, execution in Figure 1 square Method steps S101 to step S103, method steps S201 to step S202 in Figure 2, method steps S301 to step S303 in Figure 3, method steps S401 to step S403 in Figure 4, method steps S501 to step S502 in Figure 6 , the method steps S601 to step S602 in Figure 7, the method steps S701 to step S703 in Figure 8, the method steps S801 to step S802 in Figure 11, the method steps S901 to step S902 in Figure 12, the method in Figure 13 Steps S1001 to step S1003, method steps S1101 to step S1103 in FIG. 14 , and method steps S1201 to step S1202 in FIG. 15 .
  • Embodiments of the present application also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above-mentioned abnormal event processing method.
  • the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors, for example, executing steps S101 to S103 of the method in Figure 1, Figure 2
  • the embodiments of the present application at least include the following beneficial effects: the abnormal event processing method, electronic device and storage medium in the embodiment of the present application can continuously obtain multiple abnormal events at the target location within a preset period of time by executing the abnormal event processing method.
  • the target location is a link, a network element or a computer room in space.
  • Abnormal events include at least one of alarms, key performance indicator exceptions and operation logs. This enables the acquisition of multiple data sources, and then determines the aggregation in the abnormal events. point, the aggregation point can be any one of the calibrated abnormal events.
  • the embodiment of the present application can perform aggregation based on the aggregation point, and perform aggregation based on the aggregation point and the abnormal event to obtain an aggregation result for root cause analysis. It is a plurality of abnormal events obtained within a period of time.
  • the required abnormal events can be aggregated to the time node of the aggregation point according to the time node and location of the aggregation point, so that the embodiment of the present application can
  • forward aggregation can be performed to aggregate other events that may be the root cause of the fault to achieve two-way aggregation, which can improve the aggregation capabilities of data sources and improve the level of fault operation and maintenance.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, storage device storage or other magnetic storage devices, or Any other medium that can be used to store the desired information and that can be accessed by a computer.
  • communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种异常事件处理方法、电子设备及存储介质,异常事件处理方法包括:在预设时间段内获取目标位置的多个异常事件,异常事件包括告警、关键性能指标异常和操作日志中的至少一种(S101);在异常事件中确定聚合点(S102);根据聚合点和异常事件进行聚合,得到聚合结果(S103)。

Description

异常事件处理方法、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为202210678899.7、申请日为2022年06月16日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及但不限于通信技术领域,特别是涉及一种异常事件处理方法、电子设备及存储介质。
背景技术
随着移动通信技术的发展,网络复杂化、应用多样性、数据爆炸导致对智能运维的要求与日俱增。对相关的流式数据进行聚合后再进行分析是识别故障根因主要手段,然而,相关技术中,在对数据源进行聚合的时候,往往只能在告警先发生后,再以告警往后聚合,如果是告警之前事先由于某种操作引发了告警,这样的聚合则无法明确故障根因,因此聚合能力低,导致故障运维水平低下。
发明内容
本申请实施例提供了一种异常事件处理方法、电子设备及存储介质。
第一方面,本申请实施例提供了一种异常事件处理方法,所述方法包括:在预设时间段内获取目标位置的多个异常事件,所述异常事件包括告警、关键性能指标异常和操作日志中的至少一种;在所述异常事件中确定聚合点;根据所述聚合点和所述异常事件进行聚合,得到聚合结果。
第二方面,本申请实施例提供了一种电子设备,包括:存储器、处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如本申请第一方面实施例中任意一项所述的异常事件处理方法。
第三方面,本申请实施例提供了一种计算机可读存储介质,所述存储介质存储有程序,所述程序被处理器执行实现如本申请第一方面实施例中任意一项所述的异常事件处理方法。
附图说明
图1是本申请一个实施例提供的异常事件处理方法的流程示意图;
图2是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图3是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图4是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图5是本申请一个实施例提供的目标缓存区的示意图;
图6是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图7是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图8是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图9是本申请一个实施例提供的以通信异常为聚合点进行前后双向聚合的示意图;
图10是本申请一个实施例提供的以网络不通告警为聚合点进行前后双向聚合的示意图;
图11是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图12是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图13是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图14是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图15是本申请另一个实施例提供的异常事件处理方法的流程示意图;
图16是本申请一个实施例提供的电子设备的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的实施例仅用以解释本申请,并不用于限定本申请。
在本申请的描述中,需要理解的是,涉及到方位描述,例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请实施例的限制。
应了解,在本申请实施例的描述中,若干的含义为一个以上,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到“第一”、“第二”等只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
本申请实施例的描述中,除非另有明确的限定,设置、安装、连接等词语应做广义理解,所属技术领域技术人员可以结合技术方案的内容合理确定上述词语在本申请实施例中的含义。
随着5G新基建的不断推进和发展,网络复杂化,应用多样性,数据爆炸,运营商和设备商在自治网络(Autonomous Network)的“规建维优营”几个方面,对自动化和智能化的诉求与日俱增,其中,“维”即是运维,以故障处理为主,对故障的定位,从单一告警数据源的聚合分析定位演进到了多数据源,如日志、关键性能指标(Key Performance Indicator,KPI)和告警的聚合分析。
聚合分析,是指把相关的流式数据进行聚合后再进行分析识别故障根因。聚合有时间空间两个维度,这个也是所谓的时空聚合。在空间维度聚合,可以利用拓扑资源相关性,如同一个网元或者同一条链路、同一个机房的相关数据;在时间维度聚合,即相关数据在一定时间范围内进行聚合,跟空间维度的聚合不同,时间维度的聚合相对来说比较困难,主要是时间范围并不好确定。
如在一个空间维度下,在一定时间窗口内,A告警导致B告警,运营商和设备商根据历史数据的统计得来的经验形成了一些规则,下面以三条规则来举例说明,规则基本格式可以为:
1)第一条规则,同网元,5分钟窗口,射频拉远单元(Remote Radio Unit,RRU)链路 误码率高告警和光模块接收光功率异常同时发生,则认为它们两个告警可聚合;
1)第二条规则,同网元,10分钟窗口,光模块接收光功率异常和RRU链路断告警同时发生,则认为它们两个告警可聚合;
2)第三条规则,同网元,15分钟窗口,RRU链路断告警和分布式单元(Distributed Unit,DU)小区退服告警,则认为它们两个告警可聚合。
以上面几条规则看,需要把时空维度的RRU链路误码率高告警、光模块接收光功率异常告警、RRU链路断告警和DU小区退服告警聚合在一起,最后找到根因是RRU链路误码率高导致的DU小区退服。
申请人发现,相关技术中,在时间维度上的时间不好确定,需要时间步长的设计思想,不能以几条规则的最大时间15分钟来确定,也不能以所有相关规则的时间加和来确定,不仅如此,这还有一个关键性依赖,即时间维度上完全依赖向后判定,即A告警发生导致B告警发生,那么A告警发生时间会在B告警发生前,那么是以A告警发生后预测B告警发生的时间。
申请人发现,这种情况下的聚合会存在很多问题,这种情况下,告警发生时间可能会有差异,甚至可能B告警还在A告警之前,并且,无法聚合多数据源的情况,因为如果只是告警和关键性能指标异常,相对来说,容易明确异常发生时刻,即有明确的异常数据源,告警引发关键性能指标劣化或异常,那么告警先发生,以告警往后聚合,但如果是某种操作引起了相关告警,例如某种操作的日志是在告警之前,这样不方便向后聚合,由于日志不方便明确异常,所以无法即时感知,往往是告警或者关键性能指标异常后,再往前回头找相关日志,如内存泄漏这些故障,是已经发现了内存泄漏或者发现泄漏趋势,再往前找相关日志,这样属于事后聚合。
因此,相关技术中的方案存在技术缺陷,如果是告警之前事先由于某种操作引发了告警,这样的聚合则无法明确故障根因,因此聚合能力低,导致故障运维水平低下。
基于此,本申请实施例提供了一种异常事件处理方法、电子设备及存储介质,能够实现双向聚合,提高数据源的聚合能力,提高故障运维水平。
下面进行详细说明。
本申请实施例提供了一种异常事件处理方法,参照图1所示,本申请实施例中的异常事件处理方法包括但不限于步骤S101至步骤S103。
步骤S101,在预设时间段内获取目标位置的多个异常事件,异常事件包括告警、关键性能指标异常和操作日志中的至少一种。
步骤S102,在异常事件中确定聚合点。
步骤S103,根据聚合点和异常事件进行聚合,得到聚合结果。
在一实施例中,本申请实施例中的异常事件处理方法可以应用在通信设备中,通过执行异常事件处理方法,能够实现双向聚合,提高数据源的聚合能力,提高故障运维水平。本申请实施例中可以在预设时间段内获取目标位置的多个异常事件,并在所获取的异常事件中确定聚合点,异常事件包括告警、关键性能指标(KPI)异常和操作日志中的至少一种。在一实施例中,异常事件包括告警或关键性能指标异常中的一种,并且还包括操作日志。又或者,在另一实施例中,异常事件包括告警、关键性能指标异常和操作日志。本申请实施例中以包含上述三者为例子进行说明,聚合点是根据聚合需要设定的点,聚合点的类型可以通过配置 指定。
本申请实施例根据聚合点和异常事件进行聚合,得到聚合结果,可以理解的是,聚合点是众多异常事件中的其中一个或多个,由于异常事件是在预设时间段内不断获取的,因此聚合点所处的时间位于预设时间段的中间,可以理解的是,以聚合点的聚合,在聚合点的获取时间前后,可以包含多个异常事件,这些前后时间的异常事件可以是告警、关键性能指标异常和操作日志中的至少一种,因此,本申请实施例中,可以在聚合点之前事先由于某种操作引发了告警或关键性能指标异常时,可以聚合到聚合点之前和之后的数据,得到的聚合结果可以用于明确故障根因,实现双向聚合,能够提高数据源的聚合能力,提高故障运维水平。
需要说明的是,本申请实施例中的预设时间段可以根据实际运维需要设置,例如,预设时间段可以是20分钟、1小时、4小时或者更长时间,从预设时间段的起始时间开始,本申请实施例就可以开始获取作为数据源的异常事件,包括获取告警、关键性能指标异常和操作日志中的至少一种,实现多数据源的获取,本申请实施例中预设时间段内获取数据源,通过设定预设时间段的时间长短可以在时间维度上明确聚合的时间。
需要说明的是,本申请实施例中的目标位置是空间维度上的一个位置,例如,目标位置可以为一个网元、一个机房或者是一个链路,通过最终得到的聚合结果,通过聚合分析后可以得到该网元、机房或者链路的故障根因。
参照图2所示,在一实施例中,上述步骤S101中还可以包括但不限于步骤S201和步骤S202。
步骤S201,根据预设时间段的总时长在目标缓存区中建立多个时间桶,其中,时间桶由时间戳区间构成,各个时间桶的时长相同且相邻的两个时间桶的时间连续。
步骤S202,连续获取目标位置的多个异常事件,并按照各个异常事件的获取时间缓存在对应时间的时间桶中。
在一实施例中,本申请实施例通过设置时间桶的实现对异常事件的缓存方式,本申请实施例通过根据预设时间段的总时长在目标缓存区中建立多个时间桶,时间桶由时间戳区间构成,各个时间桶的时长相同且相邻的两个时间桶的时间连续,并在连续获取目标位置的多个异常事件时,按照各个异常事件的获取时间缓存在对应时间的时间桶中,实现数据缓存,目标缓存区为与目标位置缓存对应的缓存区,一个目标位置可以对应多个缓存区,或对应一个一一对应的缓存区,本申请实施例用双向时间维度进行聚合,每个缓存区按照时间戳和一定时间区间当作时间桶的方式来缓存异常事件,因此,实现了在缓存异常事件后,并不要立即聚合,还需要等待一定时间,等到时间桶缓存完毕,才开始准备聚合。
需要说明的是,在目标缓存区中,各个时间桶的时长相同且相邻的两个时间桶的时间连续,例如,当一个目标缓存区在预设时间段为20分钟内获取异常事件时,可将每个时间桶的时长设定为5分钟,因此可以得到4个连续的时间桶,其中,第一个时间桶的时间从第0分钟缓存到第5分钟,第二个时间桶从第5分钟缓存到第10分钟,第三个时间桶从第10分钟缓存到第15分钟,第四个时间桶从第15分钟缓存到第20分钟,更长的预设时间段可以以此类推,每个时间桶的时长可以根据实际运维需要设置,在此不做具体限制。
可以理解的是,本申请实施例中可以通过设定目标缓存区中时间桶不再缓存的条件,来控制停止缓存异常事件。
参照图3所示,在一实施例中,上述步骤S202中还可以包括但不限于步骤S301至步骤 S303。
步骤S301,获取预设时间段内停止缓存异常事件的收敛条件。
步骤S302,从预设时间段的起始时间开始连续获取目标位置的多个异常事件,并按照各个异常事件的获取时间缓存在对应时间的时间桶中。
步骤S303,当缓存的异常事件满足收敛条件,停止缓存异常事件。
在一实施例中,针对目前在时间维度上聚合的时间不好确定的问题,本申请实施例通过设置时间桶的模式来缓存目标位置的异常事件,并通过设定时间桶的收敛条件来控制时间桶停止缓存的时间点,本申请实施例获取预设时间段内停止缓存异常事件的收敛条件,本申请实施例中在缓存异常事件时,从预设时间段的起始时间开始连续获取目标位置的多个异常事件,并按照各个异常事件的获取时间依次缓存在对应时间的时间桶中,并在缓存的异常事件满足收敛条件时,停止缓存异常事件,当满足收敛条件后,说明目标缓存区缓存完毕,即本缓存区关闭不再接收其它异常事件,把时间桶封装后准备聚合,还可以在收敛条件满足后清除目标缓存区,以待后面的异常事件缓存,本申请实施例中通过设置收敛条件来控制什么时候停止缓存异常事件,可以不用浪费数据收集的时间,在时间维度上提高聚合的效率,提高聚合能力。
需要说明的是,从聚合点角度来看,本申请实施例中在根据满足收敛条件关闭目标缓存区以停止缓存异常事件时,当前已经缓存的数据中,已经包含了聚合点时间维度上前后双向的异常事件,即聚合点为中心的前后异常事件均已进入目标缓存区中,提高数据的聚合能力,由此可以进行双向聚合。
参照图4所示,在一实施例中,收敛条件可以包括但不限于步骤S401至步骤S403中至少之一。
步骤S401,获取异常事件的时间超过预设时间段的结束时间。
步骤S402,连续多个时间桶之间缓存异常事件的数量递减速率小于预设的目标递减速率。
步骤S403,时间桶中异常事件的数量小于预设的桶内事件数量最小阈值。
在一实施例中,本申请实施例中的收敛条件可以有多个,可以在时间维度上判断什么时候数据源收集完成,以在时间维度上提高聚合效率和聚合能力,收敛条件可以包括步骤S401至步骤S403中的至少一个,可以理解的是,当满足上述步骤中的收敛条件中的其中一个时,即可判断数据源收集完成,因此停止缓存异常事件。
需要说明的是,判断获取异常事件的时间是否超过预设时间段的结束时间是收敛条件之一,预设时间段有一个起始时间和结束时间,当获取异常事件的时间超过预设时间段的结束时间,说明缓存时间截止,即本次缓存最后一个聚合点的时间到截止时间的时间区间,这个时间区间就是预设时间段的最大值,它限制了过长时间等待,通过本申请实施例的收敛条件实现强制结束缓存,在一实施例中,聚合时间区间最大值即预设时间段设定为60分钟,过了这个时间,就不再等待后续消息。
需要说明的是,判断连续多个时间桶之间缓存异常事件的数量递减速率是否小于预设的目标递减速率是收敛条件之一,当设定连续三个时间桶的异常事件数量以一定速率递减进行判断,低于事件次数递减比率,即低于目标递减速率,也就是后桶数量低于前桶数量一定百分比,当低于目标递减速率时,判断不再需要缓存异常事件,如当目标递减速率为25%,目标递减速率的数值可根据实际需要设置,本申请实施例中的收敛条件实现边际效应递减后结 束缓存,如图5所示,目标缓存区内各个时间桶中分别缓存了用户登录日志、配置路由日志、重启路由日志、通信异常、网络不通告警、关键性能指标异常(图中的KPI异常)、业务异常、业务重启告警等异常事件,其中,通信异常、网络不通告警、关键性能指标异常和业务重启告警为聚合点,在图5的示例中,时间桶4的异常事件数量是时间桶3的三分之一,高于设定的目标递减速率(25%),所以,当前还不能结束缓存,继续接收异常事件。
需要说明的是,判断时间桶中异常事件的数量是否小于预设的桶内事件数量最小阈值是收敛条件之一,时间桶的异常事件数量小于桶内事件数量最小阈值,即桶事件数最小值,桶内事件数量最小阈值可根据实际需要设置,当低于这个桶内事件数量最小阈值时,判断不再需要缓存异常事件,本申请实施例实现了以一定时间内事件收敛后结束缓存,同样以参考图5所示,时间桶4接收的事件只有1个,小于桶内事件数量最小阈值(假设是2),即不用再接收异常事件,停止缓存异常事件,即时间桶5不必再接收,最终完成数据源的收集。
在一实施例中,目标位置有多个,参照图6所示,上述步骤S201中还可以包括但不限于步骤S501和步骤S302。
步骤S501,分别获取各个目标位置对应的预设时间段。
步骤S502,分别建立对应各个目标位置的目标缓存区,并根据各个预设时间段的总时长分别在对应的目标缓存区中建立多个时间桶。
在一实施例中,当目标位置有多个时,本申请实施例分别根据不同的目标位置进行数据源的缓存,每个不同的位置均可以对应设置一个自身缓存需要的预设时间段,本申请实施例分别获取各个目标位置对应的预设时间段,并对每个目标位置的数据进行缓存,分别建立对应各个目标位置的目标缓存区,在一实施例中,目标缓存区与目标位置一一对应,每个目标位置都有对应的一个目标缓存区,并根据各个预设时间段的总时长分别在对应的目标缓存区中建立多个时间桶,实现根据将各个目标位置的异常时间都缓存到对应的时间桶中。
本申请实施例中的目标位置为空间维度上的位置,例如,目标位置可以为一个网元、一条链路或者一个机房,多个目标位置可以包括多个网元、链路和机房,通过对不同的目标位置进行异常事件的收集,可以得到各个不同的目标位置的聚合结果,以便对各个目标位置进行聚合分析,可以理解的是,本申请实施例中可以得到各个目标位置的聚合结果,通过该聚合结果可以分析各个目标位置自身的故障根因,也可以得到多个目标位置整体的聚合结果,通过该聚合结果可以分析得到多个目标位置中的故障根因,提高了聚合能力,提高故障运维水平。
参照图7所示,在一实施例中,上述步骤S102中还可以包括但不限于步骤S601和步骤S602。
步骤S601,获取聚合点的筛选条件。
步骤S601,在多个异常事件中确定满足筛选条件的异常事件为聚合点。
在一实施例中,本申请实施例可以获取聚合点的筛选条件,从异常事件中确定聚合点,通过筛选条件可以从异常事件中确定哪些是重大告警或重大关键性能指标异常,重大告警可以是异常事件中的告警中的任意一个,重大关键性能指标异常可以是异常事件中关键性能指标异常中的任意一个,例如,聚合点如基站退服、小区退服等,聚合点是真正运维的中心,以重大告警或重大关键性能指标异常的聚合点为中心聚合适用于实际运维的需要,否则大量普通告警等的聚合会大量浪费时间,造成运维水平低下,聚合点的告警类型和关键性能指标 异常类型可以通过配置指定。
可以理解的是,本申请实施例中可以根据实际运维需要自定义设定筛选条件,以确定其中的重大告警或重大关键性能指标异常,聚合点是众多异常事件中的其中一个或多个,由于异常事件是在预设时间段内不断获取的,因此聚合点所处的时间位于预设时间段的中间,可以理解的是,以聚合点的聚合,在聚合点的获取时间前后,可以包含多个异常事件,这些前后时间的异常事件可以是告警、关键性能指标异常和操作日志中的至少一种,因此,本申请实施例中,可以在聚合点之前事先由于某种操作引发了告警或关键性能指标异常时,可以聚合到聚合点之前和之后的数据,得到的聚合结果可以用于明确故障根因,实现双向聚合,能够提高数据源的聚合能力,提高故障运维水平。
需要说明的是,相关技术中,不会以某条操作日志作为起始点进行向后聚合,是因为日志太多太频繁,而且大多数操作日志只是为了记录并不是说明异常,所以对操作日志来说,由于操作日志不方便明确异常,所以无法即时感知,往往是告警或者关键性能指标异常后,再往前回头找相关操作日志,如内存泄漏这些故障,是已经发现了内存泄漏或者发现泄漏趋势,再往前找相关操作日志,因此导致聚合能力低下。
参照图8所示,在一实施例中,上述步骤S103中还可以包括但不限于步骤S701至步骤S703。
步骤S701,在异常事件中确定第一目标事件和第二目标事件,其中,第一目标事件表征为聚合点的噪音事件,第二目标事件表征为聚合点的关联事件。
步骤S702,清除第一目标事件并保留第二目标事件。
步骤S703,根据聚合点和第二目标事件进行聚合,得到聚合结果。
在一实施例中,本申请实施例可以去异常事件进行去噪,去除其中没必要的事件,保留与聚合点相关的异常事件,以便提高聚合能力,本申请实施例中可以在异常事件中确定第一异常事件和第二异常事件,第一目标事件表征为聚合点的噪音事件,第二目标事件表征为聚合点的关联事件,作为噪音事件,若与聚合点进行聚合,会使最终的聚合结果的数据量过大,并存在众多对故障根因分析无用的异常事件,因此,本申请实施例中可以确定表征为聚合点的噪音事件,即确定第一目标事件,并确定表征为聚合点的关联事件,即第二目标事件表征,清除第一目标事件并保留第二目标事件,最终可以根据聚合点和第二目标事件进行聚合,得到聚合结果,可以提高本申请实施例的聚合能力,提高故障运维水平。
需要说明的是,由于本申请实施例中的异常事件包含了操作日志,在实际运维的过程中,会存在大量的操作日志,本申请实施例中通过双向聚合,可以得到包含聚合点前后的异常事件以得到聚合结果,即可以得到聚合点前后的操作日志以得到聚合结果,最终可以根据聚合结果进行故障根因找到导致聚合点异常的操作日志等,为了解决操作日志过多且大量与聚合点无关的问题,本申请实施例通过明确异常事件中的第一目标事件和第二目标事件,清除第一目标事件并保留第二目标事件,最终保证了本申请实施例的聚合能力和效率。
以图5中收集的异常事件为例子,当以通信异常这个异常事件作为聚合点时,可以根据图9所示对通信异常进行前后双向聚合,向前可以聚合用户登录日志、配置路由日志和重启路由日志等异常事件,向后聚合可以聚合网络不通告警、关键性能指标异常、业务异常等异常事件,而当以网络不通告警这个异常事件作为聚合点时,可以根据图10所示对通信异常进行前后双向聚合,向前可以聚合用户登录日志、配置路由日志、重启路由日志和通信异常等 异常事件,向后聚合可以聚合关键性能指标异常、业务异常和业务重启告警等异常事件。
参照图11所示,在一实施例中,上述步骤S103中还可以包括但不限于步骤S801和步骤S802。
步骤S801,将聚合点和异常事件进行聚合得到聚合包。
步骤S802,对聚合包进行根因识别,并结合各个聚合点对应的异常事件得到聚合点的根因识别结果。
在一实施例中,本申请实施例中可以进行根因识别,得到根因识别结果,本申请实施例中根据将聚合点和异常事件进行聚合得到聚合包,对聚合包进行根因识别,并结合各个聚合点对应的异常事件得到聚合点的根因识别结果,在另一实施例中,本申请实施例中根据聚合点和第二目标事件来进行聚合得到聚合包,通过对第二目标事件进行聚合,得到聚合效率更高的聚合包,把这些有用的异常事件进行根因识别,可以借用聚合点中的第二目标事件和知识库等技术分析哪个异常事件是根因事件,从而提高了故障运维水平。
参照图12所示,在一实施例中,上述步骤S701中还可以包括但不限于步骤S901和步骤S902。
步骤S901,对异常事件进行初始化处理得到初始数据,并将初始数据输入至预设的双向聚合模型中进行概率计算,分别得到各个异常事件与对应的聚合点的噪音概率值。
步骤S902,根据噪音概率值确定异常事件中的第一目标事件和第二目标事件。
在一实施例中,本申请实施例中通过获取预设的双向聚合模型,来确定异常事件中的第一目标事件和第二目标事件,双向聚合模型是一种通过神经网络模型训练得到的数据处理模型,本申请实施例中通过对异常事件进行初始化处理得到初始数据,并将初始数据输入至预设的双向聚合模型中进行概率计算,分别得到各个异常事件与对应的聚合点的噪音概率值,双向聚合模型的输入需要匹配对应的初始数据,以便双向聚合模型进行数据处理,噪音概率值可以表征该异常事件是对应的聚合点的噪音事件的概率大小,通过噪音概率值表征的概率大小就可以确定该异常事件是不是对应的聚合点的噪音事件,从而确定第一目标事件和第二目标事件。
可以理解的是,本申请实施例中的聚合点可以有多个,当聚合点为多个时,每个异常事件均可以通过双向聚合模型进行概率计算,得到针对各个聚合点的噪音概率值,这是由于,有些异常事件对某些聚合点是低概率,但对其它聚合点是高概率,因此将每个异常事件与各个聚合点进行概率计算,可以避免去除一些高概率的异常事件,有助于对所有的聚合点进行聚合。
参照图13所示,在一实施例中,上述步骤S902中还可以包括但不限于步骤S1001至步骤S1003。
步骤S1001,获取各个聚合点的第一概率阈值和第二概率阈值。
步骤S1002,将低于所有第一概率阈值的噪音概率值对应的异常事件确定为第一目标事件。
步骤S1003,将高于任意一个第二概率阈值的噪音概率值对应的异常事件确定为第二目标事件。
在一实施例中,本申请实施例通过设定低概率阈值和高概率阈值来对异常事件进行筛选,本申请实施例可以获取各个聚合点的第一概率阈值和第二概率阈值,第一概率阈值为低概率 阈值,用于是筛选得到异常事件中的第一目标事件,因此将低于所有第一概率阈值的噪音概率值对应的异常事件确定为第一目标事件,第一目标事件为低概率事件,第二概率阈值为高概率阈值,用于筛选得到第二目标事件,将高于任意一个第二概率阈值的噪音概率值对应的异常事件确定为第二目标事件,第二目标事件为高概率事件。
需要说明的是,本申请实施例中将低于低概率阈值的标记,作为低概率阈值的第一概率阈值可以在界面或配置文件设置,根据实际运维需要配置,将高于高概率阈值的异常事件放在一个高概率列表中,作为高概率阈值的第二概率阈值可以在界面或配置文件设置,根据实际运维需要配置,其键值为异常点,值为列表,列表中保存这些高于高概率阈值的异常事件,低于低概率阈值的异常事件在分析每个聚合点时暂时不忙排除,因为有些异常事件对某些聚合点是低概率但对其它聚合点是高概率,因此本申请实施例中在判断哪些异常事件为第一目标事件时,是要求对异常事件的噪音概率值低于所有的聚合点的第一概率阈值才确定为第一目标事件,而判断得到第二目标事件时,异常事件的噪音概率值只需要高于任意一个聚合点的第二概率阈值即可判断为第二目标事件。
在一实施例中,第一概率阈值,以某个聚合点来预测本聚合点的上下文关联事件时,如果某些异常事件的概率很低,对所有聚合点的概率都低于第一概率阈值,如设置成10%,则可以当作噪音去噪;第二概率阈值,以某个聚合点来预测本聚合点的上下文关联事件时,如果相关某些异常事件的概率高于第二概率阈值,如设置成75%,则可以认为相关性很强,可以协助后续根因分析。
参照图14所示,在一实施例中,上述步骤S901中还可以包括但不限于步骤S1101至步骤S1103。
步骤S1101,对异常事件进行独热编码,得到初始化后的初始向量数据。
步骤S1102,获取预设的双向聚合模型,其中,双向聚合模型根据获取样本中的样本异常事件、表征为噪音事件的样本目标事件、和样本聚合点,并通过无监督训练后得到。
步骤S1103,将初始向量数据输入至预设的双向聚合模型中进行概率计算,分别得到各个异常事件与对应的聚合点的噪音概率值。
在一实施例中,需要对异常事件进行初始化的向量转换后,才输入到预设的双向聚合模型中进行处理,得到所需要的噪音概率值。双向聚合模型可以预先根据样本中的数据建立得到,本申请实施例由于在聚合中,会有大量的异常事件,而其中对根因分析并非所有事件都有用,有一些事件对聚合点来说,是噪音事件,如一些日常操作的操作日志,闪断告警正好跟异常点在某一时间窗口,它们的存在干扰了聚合分析,因此通过人工智能(Artificial Intelligence,AI)训练,通过一定概率来过滤,能够让聚合分析更加准确。
如同在自然语言处理(Natural Language Processing,NLP)中,词语向量化(Word vecor(word embedding),Word2vec)的跳字模型(Continuous Skip-Gram Model,Skip-gram)模型,使用中心词预测上下文词语的概率的这个原理,本申请实施例使用同样的原理,把异常事件向量化后,通过双向聚合模型得到聚合点对前后时间段的异常事件的概率大小,通过概率阈值进行异常事件去噪。
本申请实施例中可以设置训练器,来加载历史数据,历史数据可以包括样本中的样本异常事件、表征为噪音事件的样本目标事件、和样本聚合点,在训练阶段,对样本中的这些数据进行独热编码(One-hot coding)后,通过无监督训练,当上下文概率最大,损失函数最 小,即可把异常事件向量化,则训练得到双向聚合模型,以后后续应用需要。
本申请实施例中在进行概率计算时,先加载训练好的双向聚合模型,对异常事件进行独热编码,得到初始化后的初始向量数据,将初始向量数据输入至预设的双向聚合模型中进行概率计算,分别得到各个异常事件与对应的聚合点的噪音概率值,输入在双向聚合模型钱已经通过独热编码进行初始化向量表达了,因此可以直接使用通过训练发布得到的双向模型对事件进行概率计算,对异常事件进行向量概率计算,得到各个异常事件与对应的聚合点的噪音概率值。
可以理解的是,本申请实施可使用但不限于Word2vec的Skip-gram模型进行训练,包括神经网络的搭建,得到所需要的双向聚合模型,首先获取样本中的历史数据,或者根据故障处理手册等当作语料库进行独热编码后,用Skip-gram模型,其损失函数为所有概率最小化,这个时候,不同的告警、操作日志等异常事件,它们的相关性是经过训练得到中间隐藏层,这个就是最终需要的模型,训练得到双向聚合模型的步骤在本申请实施例中不做具体描述。
参照图15所示,在一实施例中,上述步骤S901中还可以包括但不限于步骤S1201和步骤S1202。
步骤S1201,将多个聚合点按照时间排序并存放在聚合点列表中。
步骤S1202,将初始数据输入至预设的双向聚合模型中,并按照聚合点列表中的各个聚合点分别对异常事件进行概率计算,得到各个异常事件与对应的聚合点的噪音概率值。
在一实施例中,本申请实施例中通过建立聚合点列表来存放聚合点,在目标缓存区停止缓存异常事件后,目标缓存区关闭不再接收其它异常事件,把时间桶封装成初始包准备聚合,然后清除缓存区,以待后面的事件缓存,需要强调的是,从聚合点角度来看,和常规做法单向往后聚合的不同在于,本申请实施例在关闭目标缓存区时,已经包含了前后双向的事件,即聚合点为中心的前后事件均已进入缓存区,本申请实施例中的双向聚合是以聚合点为准来双向聚合,因此可以收集目标缓存区的聚合点,如果没有聚合点,则目标缓存区直接回收用于下一次缓存,如果一个目标缓存区甚至缓存区中某一个桶内可能有多个聚合点,先把聚合点收集起来,按时间排序,存放在聚合点列表中,然后给出缓存区中最早发生时间的异常事件和最迟发生时间的异常事件,此外,还可以给出目标缓存区的位置,如网元、机房,或链路。
需要说明的是,在进行概率计算时,本申请实施例先得到聚合点列表,然后对列表中的聚合点通过双向模型得到数据中其它异常事件的噪音概值率,将低于低概率阈值(的标记,将高于高概率阈值的异常事件放在一个高概率列表中,其键值为异常点,值为列表,列表中保存这些高于高概率阈值的异常事件,最终在去噪后,可以把聚合点附加上高概率列表,组合成聚合包,发给根因识别。
此外,本申请实施例中的异常事件处理方法可以应用在异常事件处理装置中,简称处理装置,处理装置可以包括:
缓存器:接收外部异常事件进入缓存区,组装成初始包发送给打包器;
打包器:接收初始包,打包成编码包发送给聚合器;
聚合器:接收编码包,以聚合点为中心进行上下文概率训练和预测,去噪,得到聚合包,发送给根因分析;
训练器、缓存器、打包器、聚合器和训练器之间通信连接,通过处理装置执行上述实施 例中的异常事件处理方法时,可以包括以下四步:
第一步:训练器训练双向聚合模型完成事件向量化。
在聚合中,可能会有大量的异常事件,而其中对根因分析并非所有事件都有用,有一些事件对聚合点来说,是噪音事件,如一些日常操作日志,闪断告警正好跟异常点在某一时间窗口,它们的存在干扰了聚合分析,因此通过AI训练,通过一定概率来过滤,能够让聚合分析更加准确。
如同在NLP中,Word2vec的Skip-gram模型,使用中心词预测上下文词语的概率的这个原理,本申请实施例使用同样的原理,把异常事件向量化后,通过双向聚合模型得到聚合点对前后时间段的异常事件的概率大小,通过概率阈值进行异常事件去噪。
训练器加载历史告警、日志和关键性能指标异常以及故障处理手册等当作语料库进行独热编码后,通过无监督训练,当上下文概率最大,损失函数最小,即可把异常事件向量化,则得到双向聚合模型,然后发布模型。
第二步:缓存器接收并缓存流式异常事件。
异常事件是流式输入,所以需要缓存一定时间段的异常事件。
缓存器根据不同的空间维度,即不同位置设置不同的缓存区,一个缓存区只能缓存同一个空间维度的异常事件,每个缓存区按照时间戳和一定时间区间当作时间桶的方式来缓存异常事件,如果该事件是聚合点,进行标记。
有了聚合点后,并不要立即聚合,还需要等待一定时间,等到时间桶缓存完毕,才开始准备聚合。一个时间桶为一个时间区间,如五分钟,里面缓存这5分钟的异常事件,下一个时间桶则缓存下一个时间区间,如五分钟的异常事件。
如参考图5,流式异常事件进入后,同一位置一个缓存区,图中每5分钟一个时间桶缓存一批异常事件,不同时间桶可能大小不一样。
一个缓存区由一个或多个时间桶组成,关键是什么时候截止,即本次缓存完毕,可以聚合了,本申请实施例采用三个维度作为收敛条件完成最后一个时间桶的缓存:
缓存时间截止,即本次缓存最后一个聚合点的时间到截止时间的时间区间,这个时间区间就是时间区间最大值,它限制了过长时间等待,这个做法是强制结束;
连续三个时间桶的异常事件数量以一定速率递减,低于事件次数递减比率,即后桶数量低于前桶数量一定百分比,如25%,这个数字可设置,这个做法是边际效应递减后结束,以参考图5示意,时间桶4的事件数量是时间桶3的三分之一,所以,还不能结束,继续接收异常事件;
时间桶的异常事件数量小于桶内事件数量最小阈值,即桶事件数最小值,这个值可设置,这个做法是一定时间内事件收敛后结束,同样以参考图5示意,时间桶4接收的事件只有1个,小于桶内事件数量最小阈值(假设是2),即不用再接收,即时间桶5不必再接收异常数据。
当上述三个条件任何一个条件满足后,本缓存区缓存完毕,即本缓存区关闭不再接收其它异常事件,把时间桶封装成初始包发给打包器准备聚合,然后清除缓存区,以待后面的事件缓存。
需要强调的是,从聚合点角度来看,和常规做法单向往后聚合的不同在于,本申请实施例关闭缓存区时,已经包含了前后双向的事件,即聚合点为中心的前后事件均已进入缓存区。
第三步:打包器进行初始包进行打包。
缓存完毕后,打包器对初始包进行打包成聚合包,本申请的双向聚合是以聚合点为准来双向聚合,因此打包器首先收集本缓存区的聚合点,如果没有聚合点,则本缓存区直接回收用于下一次缓存,如果一个缓存区甚至缓存区中某一个桶内可能有多个聚合点,先把聚合点收集起来,按时间排序,然后给出缓存区中最早发生时间的异常事件和最迟发生时间的异常事件,给出本缓存区的位置,如网元、机房,或链路,对本缓存区的异常事件进行独热编码,完成上述操作后,打包器打包完毕,得到编码包,打包器把编码包发送给聚合器进行聚合。
第四步:聚合。
聚合器加载训练好的双向聚合模型对聚合包中的聚合点向前和向后双去噪完成聚合。
在第三步完毕时,聚合器收到打包器发送过来的编码包,由于已经独热编码了,可以直接使用通过训练发布得到的双向模型对事件进行向量化,对包内异常事件进行向量概率计算。
聚合器先得到聚合点列表,然后对列表中的聚合点通过双向模型得到本包中其它异常事件的概率,将低于低概率阈值(可在界面或配置文件设置)的标记,将高于高概率阈值的异常事件(可能也是聚合点)放在一个高概率列表中,其键值为异常点,值为列表,列表中保存这些高于高概率阈值的异常事件。
注意,低于低概率阈值的异常事件在分析每个聚合点时暂时不忙排除,因为有些异常事件对某些聚合点是低概率但对其它聚合点是高概率,当该编码包中所有聚合点分析完毕后,对所有标记低概率的异常事件进行查看,如果其对所有聚合点概率都低于最低概率阈值,则清除。
聚合器在分析完所有聚合点后,还可以再将标记的低于低概率阈值的异常事件进行二度检查,看它们是否对每个聚合点都是低于低概率,如果否,则保留,否则进行去噪清除,去噪后,聚合器把编码包中的聚合点附加上高概率列表,组合成聚合包,发给根因识别,本次聚合完毕。
图16示出了本申请实施例提供的电子设备100。电子设备100包括:处理器110、存储器120及存储在存储器120上并可在处理器110上运行的计算机程序,计算机程序运行时用于执行上述的异常事件处理方法。
处理器110和存储器120可以通过总线或者其他方式连接。
存储器120作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本申请实施例描述的异常事件处理方法。处理器110通过运行存储在存储器120中的非暂态软件程序以及指令,从而实现上述的异常事件处理方法。
存储器120可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述的异常事件处理方法。此外,存储器120可以包括高速随机存取存储器120,还可以包括非暂态存储器120,例如至少一个储存设备存储器件、闪存器件或其他非暂态固态存储器件。在一些实施方式中,存储器120可包括相对于处理器110远程设置的存储器120,这些远程存储器120可以通过网络连接至该电子设备100。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述的异常事件处理方法所需的非暂态软件程序以及指令存储在存储器120中,当被一个或者多个处理器110执行时,执行上述的异常事件处理方法,例如,执行图1中的方 法步骤S101至步骤S103、图2中的方法步骤S201至步骤S202、图3中的方法步骤S301至步骤S303、图4中的方法步骤S401至步骤S403、图6中的方法步骤S501至步骤S502、图7中的方法步骤S601至步骤S602、图8中的方法步骤S701至步骤S703、图11中的方法步骤S801至步骤S802、图12中的方法步骤S901至步骤S902、图13中的方法步骤S1001至步骤S1003、图14中的方法步骤S1101至步骤S1103、图15中的方法步骤S1201至步骤S1202。
本申请实施例还提供了计算机可读存储介质,存储有计算机可执行指令,计算机可执行指令用于执行上述的异常事件处理方法。
在一实施例中,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个控制处理器执行,例如,执行图1中的方法步骤S101至步骤S103、图2中的方法步骤S201至步骤S202、图3中的方法步骤S301至步骤S303、图4中的方法步骤S401至步骤S403、图6中的方法步骤S501至步骤S502、图7中的方法步骤S601至步骤S602、图8中的方法步骤S701至步骤S703、图11中的方法步骤S801至步骤S802、图12中的方法步骤S901至步骤S902、图13中的方法步骤S1001至步骤S1003、图14中的方法步骤S1101至步骤S1103、图15中的方法步骤S1201至步骤S1202。
本申请实施例至少包括以下有益效果:本申请实施例中的异常事件处理方法、电子设备及存储介质,通过执行异常事件处理方法,可以在预设时段段内不断获取目标位置的多个异常事件,目标位置是空间上一条链路、一个网元或者一个机房,异常事件包括告警、关键性能指标异常和操作日志中的至少一种,实现了多数据源的获取,随后在异常事件中确定聚合点,聚合点可以为其中的任意一个标定的异常事件,在聚合的时候,本申请实施例可以根据聚合点进行聚合,根据聚合点和异常事件进行聚合得到聚合结果,以便进行根因分析,由于是在一段时间内获取的多个异常事件,在聚合的时候,根据聚合点所处的时间节点和位置等可以向该聚合点的时间节点以前聚合所需要的异常事件,使得本申请实施例在聚合的时候,不仅可以向后聚合,还可以向前聚合,将其他可能为故障根因的事件聚合起来,实现双向聚合,能够提高数据源的聚合能力,提高故障运维水平。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、储存设备存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
还应了解,本申请实施例提供的各种实施方式可以任意进行组合,以实现不同的技术效果。
以上是对本申请的若干实施方式进行了说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请范围的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (14)

  1. 一种异常事件处理方法,包括:
    在预设时间段内获取目标位置的多个异常事件,所述异常事件包括告警、关键性能指标异常和操作日志中的至少一种;
    在所述异常事件中确定聚合点;
    根据所述聚合点和所述异常事件进行聚合,得到聚合结果。
  2. 根据权利要求1所述的异常事件处理方法,其中,所述在预设时间段内获取目标位置的多个异常事件,包括:
    根据预设时间段的总时长在目标缓存区中建立多个时间桶,其中,所述时间桶由时间戳区间构成,各个所述时间桶的时长相同且相邻的两个所述时间桶的时间连续;
    连续获取目标位置的多个异常事件,并按照各个所述异常事件的获取时间缓存在对应时间的所述时间桶中。
  3. 根据权利要求2所述的异常事件处理方法,其中,所述连续获取目标位置的多个异常事件,并按照各个所述异常事件的获取时间缓存在对应时间的所述时间桶中,包括:
    获取所述预设时间段内停止缓存所述异常事件的收敛条件;
    从所述预设时间段的起始时间开始连续获取目标位置的多个异常事件,并按照各个所述异常事件的获取时间缓存在对应时间的所述时间桶中;
    当缓存的所述异常事件满足所述收敛条件,停止缓存所述异常事件。
  4. 根据权利要求3所述的异常事件处理方法,其中,所述收敛条件包括以下至少之一:
    获取所述异常事件的时间超过所述预设时间段的结束时间;或
    连续多个所述时间桶之间缓存所述异常事件的数量递减速率小于预设的目标递减速率;或
    所述时间桶中所述异常事件的数量小于预设的桶内事件数量最小阈值。
  5. 根据权利要求3所述的异常事件处理方法,其中,所述目标位置有多个,所述根据预设时间段的总时长在目标缓存区中建立多个时间桶,包括:
    分别获取各个所述目标位置对应的预设时间段;
    分别建立对应各个所述目标位置的目标缓存区,并根据各个所述预设时间段的总时长分别在对应的所述目标缓存区中建立多个时间桶。
  6. 根据权利要求1所述的异常事件处理方法,其中,所述在所述异常事件中确定聚合点,包括:
    获取聚合点的筛选条件;
    在多个所述异常事件中确定满足所述筛选条件的所述异常事件为所述聚合点。
  7. 根据权利要求1所述的异常事件处理方法,其中,所述根据所述聚合点和所述异常事件进行聚合,得到聚合结果,包括:
    在所述异常事件中确定第一目标事件和第二目标事件,其中,所述第一目标事件表征为所述聚合点的噪音事件,所述第二目标事件表征为所述聚合点的关联事件;
    清除所述第一目标事件并保留所述第二目标事件;
    根据所述聚合点和所述第二目标事件进行聚合,得到聚合结果。
  8. 根据权利要求1或7所述的异常事件处理方法,其中,所述根据所述聚合点和所述异常事件进行聚合,得到聚合结果,包括:
    将所述聚合点和所述异常事件进行聚合得到聚合包;
    对所述聚合包进行根因识别,并结合各个所述聚合点对应的所述异常事件得到所述聚合点的根因识别结果。
  9. 根据权利要求7所述的异常事件处理方法,其中,所述在所述异常事件中确定第一目标事件和第二目标事件,包括:
    对所述异常事件进行初始化处理得到初始数据,并将所述初始数据输入至预设的双向聚合模型中进行概率计算,分别得到各个所述异常事件与对应的所述聚合点的噪音概率值;
    根据所述噪音概率值确定所述异常事件中的第一目标事件和第二目标事件。
  10. 根据权利要求9所述的异常事件处理方法,其中,所述根据所述噪音概率值确定所述异常事件中的第一目标事件和第二目标事件,包括:
    获取各个所述聚合点的第一概率阈值和第二概率阈值;
    将低于所有所述第一概率阈值的所述噪音概率值对应的所述异常事件确定为第一目标事件;
    将高于任意一个所述第二概率阈值的所述噪音概率值对应的所述异常事件确定为第二目标事件。
  11. 根据权利要求9所述的异常事件处理方法,其中,所述对所述异常事件进行初始化处理得到初始数据,并将所述初始数据输入至预设的双向聚合模型中进行概率计算,分别得到各个所述异常事件与对应的所述聚合点的噪音概率值,包括:
    对所述异常事件进行独热编码,得到初始化后的初始向量数据;
    获取预设的双向聚合模型,其中,所述双向聚合模型根据获取样本中的样本异常事件、表征为噪音事件的样本目标事件、和样本聚合点,并通过无监督训练后得到;
    将所述初始向量数据输入至预设的所述双向聚合模型中进行概率计算,分别得到各个所述异常事件与对应的所述聚合点的噪音概率值。
  12. 根据权利要求9所述的异常事件处理方法,其中,所述将所述初始数据输入至预设的双向聚合模型中进行概率计算,分别得到各个所述异常事件与对应的所述聚合点的噪音概率值,包括:
    将多个所述聚合点按照时间排序并存放在聚合点列表中;
    将所述初始数据输入至预设的双向聚合模型中,并按照所述聚合点列表中的各个所述聚合点分别对所述异常事件进行概率计算,得到各个所述异常事件与对应的所述聚合点的噪音概率值。
  13. 一种电子设备,包括:存储器、处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时如实现权利要求1至12中任意一项所述的异常事件处理方法。
  14. 一种计算机可读存储介质,所述存储介质存储有程序,所述程序被处理器执行实现如权利要求1至12中任意一项所述的异常事件处理方法。
PCT/CN2023/099448 2022-06-16 2023-06-09 异常事件处理方法、电子设备及存储介质 WO2023241484A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210678899.7 2022-06-16
CN202210678899.7A CN117290133A (zh) 2022-06-16 2022-06-16 异常事件处理方法、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023241484A1 true WO2023241484A1 (zh) 2023-12-21

Family

ID=89192200

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/099448 WO2023241484A1 (zh) 2022-06-16 2023-06-09 异常事件处理方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN117290133A (zh)
WO (1) WO2023241484A1 (zh)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6816461B1 (en) * 2000-06-16 2004-11-09 Ciena Corporation Method of controlling a network element to aggregate alarms and faults of a communications network
CN103001811A (zh) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 故障定位方法和装置
US20190207962A1 (en) * 2017-12-28 2019-07-04 Microsoft Technology Licensing, Llc Enhanced data aggregation techniques for anomaly detection and analysis
US20190243352A1 (en) * 2018-02-08 2019-08-08 Johnson Controls Technology Company Building management system to detect anomalousness with temporal profile
CN110609759A (zh) * 2018-06-15 2019-12-24 华为技术有限公司 一种故障根因分析的方法及装置
CN110708204A (zh) * 2019-11-18 2020-01-17 上海维谛信息科技有限公司 一种基于运维知识库的异常处理方法、系统、终端及介质
CN110825769A (zh) * 2019-10-11 2020-02-21 苏宁金融科技(南京)有限公司 一种数据指标异常的查询方法和系统
CN114365094A (zh) * 2019-09-23 2022-04-15 谷歌有限责任公司 使用倒排索引的时序异常检测
CN114510364A (zh) * 2022-02-11 2022-05-17 青岛特来电新能源科技有限公司 文本聚类结合链路调用的异常数据根因分析方法及装置
CN114584452A (zh) * 2020-11-16 2022-06-03 华为技术服务有限公司 处理故障的方法、装置及系统

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6816461B1 (en) * 2000-06-16 2004-11-09 Ciena Corporation Method of controlling a network element to aggregate alarms and faults of a communications network
CN103001811A (zh) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 故障定位方法和装置
US20190207962A1 (en) * 2017-12-28 2019-07-04 Microsoft Technology Licensing, Llc Enhanced data aggregation techniques for anomaly detection and analysis
US20190243352A1 (en) * 2018-02-08 2019-08-08 Johnson Controls Technology Company Building management system to detect anomalousness with temporal profile
CN110609759A (zh) * 2018-06-15 2019-12-24 华为技术有限公司 一种故障根因分析的方法及装置
CN114365094A (zh) * 2019-09-23 2022-04-15 谷歌有限责任公司 使用倒排索引的时序异常检测
CN110825769A (zh) * 2019-10-11 2020-02-21 苏宁金融科技(南京)有限公司 一种数据指标异常的查询方法和系统
CN110708204A (zh) * 2019-11-18 2020-01-17 上海维谛信息科技有限公司 一种基于运维知识库的异常处理方法、系统、终端及介质
CN114584452A (zh) * 2020-11-16 2022-06-03 华为技术服务有限公司 处理故障的方法、装置及系统
CN114510364A (zh) * 2022-02-11 2022-05-17 青岛特来电新能源科技有限公司 文本聚类结合链路调用的异常数据根因分析方法及装置

Also Published As

Publication number Publication date
CN117290133A (zh) 2023-12-26

Similar Documents

Publication Publication Date Title
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
CN109961204B (zh) 一种微服务架构下业务质量分析方法和系统
CN107370806B (zh) Http状态码监控方法、装置、存储介质和电子设备
CN103532940B (zh) 网络安全检测方法及装置
CN104584483B (zh) 用于自动确定服务质量降级的原因的方法和设备
CN103532776A (zh) 业务流量检测方法及系统
CN111277511B (zh) 传输速率控制方法、装置、计算机系统及可读存储介质
CN111966289B (zh) 基于Kafka集群的分区优化方法和系统
CN115038088B (zh) 一种智能网络安全检测预警系统和方法
CN111953568B (zh) 丢包信息管理方法与装置
CN112769605B (zh) 一种异构多云的运维管理方法及混合云平台
WO2022111068A1 (zh) Rru欠压风险预测方法、装置、系统、设备及介质
US20190104028A1 (en) Performance monitoring at edge of communication networks using hybrid multi-granular computation with learning feedback
CN111756560A (zh) 一种数据处理方法、装置及存储介质
CN113904977A (zh) 多链路网关数据传输方法、装置、电子设备和可读介质
WO2024088025A1 (zh) 一种基于多维数据的5gc网元自动化纳管方法及装置
WO2023241484A1 (zh) 异常事件处理方法、电子设备及存储介质
CN117336228A (zh) 一种基于机器学习的igp仿真推荐方法、装置及介质
CN105446707B (zh) 一种数据转换方法
CN116170203A (zh) 一种安全风险事件的预测方法及系统
CN115883392A (zh) 算力网络的数据感知方法、装置、电子设备及存储介质
CN111461451B (zh) 一种配电通信网络的运维方法、装置及存储介质
CN113254313A (zh) 一种监控指标异常检测方法、装置、电子设备及存储介质
CN112560992B (zh) 优化图片分类模型的方法、装置、电子设备及存储介质
Gao et al. The diagnosis of wired network malfunctions based on big data and traffic prediction: An overview

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23823052

Country of ref document: EP

Kind code of ref document: A1