CN114128226A

CN114128226A - Root cause analysis and automation using machine learning

Info

Publication number: CN114128226A
Application number: CN202080040125.0A
Authority: CN
Inventors: 金杉; P.马达迪; V.钱德拉塞卡尔; E.M.约翰逊; 张建中; 朴容奭; R.D.福德
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-05-30
Filing date: 2020-06-01
Publication date: 2022-03-01
Also published as: EP3921980A4; EP3921980A1; KR20220003513A; US20200382361A1; US11496353B2; WO2020242275A1

Abstract

A method for discovering and diagnosing network anomalies. The method includes receiving Key Performance Indicator (KPI) data and warning data. The method includes extracting features based on samples obtained by discretizing KPI data and warning data. The method includes generating a rule set based on the feature. The method includes identifying the sample as a normal sample or an abnormal sample. In response to identifying the sample as an anomalous sample, the method includes identifying a first rule corresponding to the sample, wherein the first rule indicates a symptom and a root cause of an anomaly included in the sample. The method also includes applying the root cause to derive a root cause interpretation of the anomaly and performing a corrective action based on the first rule to resolve the anomaly.

Description

Root cause analysis and automation using machine learning

Technical Field

The present disclosure relates generally to machine learning-based root cause analysis of anomalies in cellular networks. More particularly, the present disclosure relates to discovering, diagnosing and recovering from network anomalies.

Background

Recently, the number of subscribers to wireless communication services, such as cellular networks, has exceeded 50 billion and continues to grow rapidly. The demand for wireless data traffic is rapidly increasing due to the increasing popularity of smart phones and other mobile data devices, such as tablet computers, "notepad" computers, netbooks, e-book readers, and machine type devices, among consumers and businesses.

The explosive demand for mobile data traffic presents significant operational challenges to mobile network operators in view of bandwidth and infrastructure limitations. Any change in network conditions, such as higher radio frequency interference, can negatively impact user experience, such as increased pauses in streaming media content. Thus, to enhance the user experience, the service provider must quickly discover the anomaly, uncover the underlying root cause of the anomaly, and apply remedial actions. Cellular networks typically take a significant amount of time and labor to detect and repair network anomalies. This inevitably results in a prolonged network outage and a reduced end-user quality of experience.

Disclosure of Invention

Technical problem

The present disclosure provides root cause analysis and automation using machine learning.

Problem solving scheme

In one embodiment, an apparatus for discovering and diagnosing network anomalies is provided. The apparatus includes a communication interface and a processor. The communication interface is configured to receive Key Performance Indicator (KPI) data and warning data. The processor is configured to extract features based on samples obtained by discretizing the KPI data and the warning data. The processor is configured to generate a set of rules based on the feature, wherein a portion of the sample that satisfies the rules corresponds to the anomaly. The processor is configured to identify a sample of the samples as a normal sample or an abnormal sample based on the KPI data and the warning data. In response to identifying the sample as an anomalous sample, the processor is configured to identify a first rule corresponding to the sample, wherein the first rule indicates a symptom and a root cause of an anomaly included in the sample. The processor is configured to identify an application root cause to derive a root cause explanation for the anomaly based on KPIs associated with symptoms and the root cause of the anomaly. The processor is configured to perform a corrective action based on the first rule to resolve the exception.

In another embodiment, a method is provided. The method includes receiving KPI data and warning data. The method includes extracting features based on samples obtained by discretizing KPI data and warning data. The method includes generating a set of rules based on the feature, wherein a portion of the sample that satisfies the rules corresponds to the anomaly. The method includes identifying the sample as a normal sample or an abnormal sample. In response to identifying a sample as an anomalous sample, the method includes identifying a first rule corresponding to the sample. The first rule indicates a sign and a root cause of an anomaly contained in the sample. The method includes applying the root cause to derive a root cause explanation for the anomaly based on KPIs associated with symptoms and root causes of the anomaly. The method includes performing a corrective action based on a first rule to resolve the anomaly.

Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, wherein like reference numbers represent like parts:

FIG. 1 illustrates an example computing system in accordance with various embodiments of the present disclosure;

2a-2b and 3a-3b illustrate example devices in a computing system according to various embodiments of the present disclosure;

FIG. 4a illustrates a root cause analysis framework for using machine learning, in accordance with various embodiments of the present disclosure;

FIG. 4b illustrates a block diagram of flow prediction based on anomaly detection, in accordance with various embodiments of the present disclosure;

FIG. 4c illustrates a block diagram of traffic prediction based on anomaly prediction, in accordance with various embodiments of the present disclosure;

FIG. 4d illustrates an example framework of a quantile regression forest according to various embodiments of the present disclosure;

fig. 5a illustrates a block diagram for selecting a discretization parameter threshold in accordance with various embodiments of the present disclosure;

FIG. 5b illustrates an example KPI tree hierarchy in accordance with various embodiments of the present disclosure;

FIG. 5c illustrates a flow diagram for constructing a KPI tree hierarchy in accordance with various embodiments of the present disclosure;

FIG. 5d illustrates an example output of a KPI tree hierarchy according to various embodiments of the present disclosure;

FIG. 5e illustrates a graph for correlating key quality indicator anomalies with warning data, in accordance with various embodiments of the present disclosure;

FIG. 5f shows a diagram of alerts and time ordering rules (comparisons) for PM data, according to various embodiments of the present disclosure;

FIG. 5g illustrates a flow diagram for rank regularization and union processing of alerts and PM data using historical data according to various embodiments of the present disclosure;

FIG. 5h illustrates a flow diagram for ordering regularization and joint processing of alerts and PM data in real time, according to various embodiments of the present disclosure;

FIG. 6 illustrates a process for generating a root cause explanation in accordance with various embodiments of the present disclosure;

FIG. 7 illustrates an example decision tree for root cause analysis correlating a certain KQI anomaly, in accordance with various embodiments of the present disclosure; and

fig. 8 illustrates an example method for discovering and diagnosing network anomalies according to various embodiments of the present disclosure.

Detailed Description

Before proceeding with the following detailed description, it may be advantageous to set forth definitions of certain words and phrases used in this patent document. The term "couple" and its derivatives refer to any direct or indirect communication between two or more elements, whether or not those elements are in physical contact with one another. The terms "transmit," "receive," and "communicate," as well as derivatives thereof, include both direct and indirect communication. The terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation. The term "or" is inclusive, meaning and/or. The phrase "associated with … …," and derivatives thereof, means including, included in … …, interconnected with … …, contained within … …, connected to … … or connected to … …, coupled to … … or coupled to … …, communicable with … …, cooperative with … …, cross-linked, juxtaposed, proximate, bound or bound to … …, having a property of … …, having a relationship to … …, and the like. The term "controller" refers to any device, system, or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase "at least one of … …, when used with a list of items means that different combinations of one or more of the listed items can be used and only one item in the list may be required. For example, "at least one of A, B and C" includes any combination of: A. b, C, respectively; a and B; a and C; b and C; and A, B and C.

Further, the various functions described below may be implemented or supported by one or more computer programs, each computer program formed from computer readable program code and embodied in a computer readable medium. The terms "application" and "program" refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in suitable computer readable program code. The phrase "computer readable program code" includes any type of computer code, including source code, object code, and executable code. The phrase "computer readable medium" includes any type of medium capable of being accessed by a computer, such as Read Only Memory (ROM), Random Access Memory (RAM), a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), or any other type of memory. A "non-transitory" computer-readable medium does not include a wired, wireless, optical, or other communication link that transmits transitory electrical or other signals. Non-transitory computer-readable media include media that can permanently store data and media that can store data and later overwrite, such as a rewritable optical disc or an erasable memory device.

Definitions for certain other words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.

Figures 1 through 8, discussed below, and the various embodiments used to describe the principles of the present disclosure in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. Those skilled in the art will understand that the principles of the present disclosure may be implemented in any suitably arranged wireless communication system.

Data traffic has increased since the deployment of 4G communication systems. In order to meet the demand for wireless data traffic, efforts have been made to develop improved 5G or pre-5G (pre-5G) communication systems. Accordingly, the 5G or pre-5G communication system is also referred to as a "super 4G network" or a "post-LTE system".

The 5G communication system is considered to be implemented in a higher frequency (millimeter wave (mmWave), e.g., 60GHz band) band to achieve a higher data rate. In order to reduce propagation loss of radio waves and increase transmission distance, beamforming, massive Multiple Input Multiple Output (MIMO), full-dimensional MIMO (FD-MIMO), array antenna, analog beamforming, massive antenna technology may be used for the 5G communication system.

Further, in the 5G communication system, development of improvements to a system network based on advanced small cells, cloud Radio Access Network (RAN), ultra-dense network, device-to-device (D2D) communication, wireless backhaul, mobile network, cooperative communication, coordinated multipoint (CoMP), reception-side interference cancellation, and the like is underway. In 5G systems, hybrid FSK and QAM modulation (FQAM) and Sliding Window Superposition Coding (SWSC) have been developed as Advanced Coding Modulation (ACM), and filter bank multi-carrier (FBMC), non-orthogonal multiple access (NOMA), and Sparse Code Multiple Access (SCMA) as advanced access techniques.

Fig. 1 illustrates an example wireless network 100 in accordance with this disclosure. The embodiment of the wireless network 100 shown in fig. 1 is for illustration only. Other embodiments of wireless network 100 may be used without departing from the scope of this disclosure.

The wireless network 100 includes an enodeb (eNB)101, an eNB 102, and an eNB 103. The eNB 101 communicates with the eNB 102 and the eNB 103. The eNB 101 also communicates with at least one Internet Protocol (IP) network 130, such as the internet, a private IP network, or other data network. In certain embodiments, wireless network 100 includes a server that maintains enbs.

Depending on the network type, the term "base station" or "BS" may refer to any component (or collection of components) configured to provide wireless access to a network, such as a Transmission Point (TP), a Transmission Reception Point (TRP), an enhanced base station (eNodeB or eNB or gNB), a macrocell, a femtocell, a WiFi Access Point (AP), or other wireless-enabled device. The base station may provide wireless access in accordance with one or more wireless communication protocols, e.g., 5G 3GPP new radio interface/access (NR), Long Term Evolution (LTE), LTE-advanced (LTE-a), High Speed Packet Access (HSPA), Wi-Fi 802.11a/b/G/n/ac, etc. For convenience, in this patent document, the terms "BS" and "TRP" are used interchangeably to refer to network infrastructure components that provide wireless access to a remote terminal. Furthermore, the term "user equipment" or "UE" may refer to any component, such as a "mobile station," "subscriber station," "remote terminal," "wireless terminal," "reception point," or "user equipment," depending on the type of network. For convenience, the terms "user equipment" and "UE" are used in this patent document to refer to a remote wireless device that wirelessly accesses a BS, whether the UE is a mobile device (such as a mobile phone or smartphone) or generally considered a stationary device (such as a desktop computer or vending machine).

eNB 102 provides wireless broadband access to network 130 for a first plurality of User Equipments (UEs) within coverage area 120 of eNB 102. The first plurality of UEs includes UE 111, which may be located in a Small Business (SB); a UE 112, which may be located in enterprise (E); UE 113, which may be located in a WiFi Hotspot (HS); UE 114, which may be located in a first residence (R); a UE 115, which may be located in a second residence (R); and a UE116, which may be a mobile device (M) such as a cellular phone, wireless laptop, wireless PDA, server, etc. eNB 103 provides wireless broadband access to network 130 for a second plurality of UEs within coverage area 125 of eNB 103. The second plurality of UEs includes UE 115 and UE 116. In some embodiments, one or more of the eNBs 101-103 may communicate with each other and the UE 111-116 using 5G, Long Term Evolution (LTE), LTE-A, WiMAX, or other advanced wireless communication technologies.

The dashed lines illustrate the general extent of

coverage areas

120 and 125, and

coverage areas

120 and 125 are shown as being generally circular for purposes of illustration and explanation only. It should be clearly understood that the coverage areas associated with the enbs, such as

coverage areas

120 and 125, may have other shapes, including irregular shapes, depending on the configuration of the enbs and the variations in the radio environment associated with natural and man-made obstructions.

As described in more detail below, cellular networks use a significant amount of time and labor to detect and repair various network anomalies. Thus, human intervention may result in network outage times that are extended and degrade the end user experience. Embodiments of the present disclosure provide analytical methods for efficiently and accurately automating the discovery and diagnosis of network anomalies. For example, embodiments of the present disclosure describe a process that uses machine learning to discover network anomalies and reveal the root causes of the anomalies.

Although fig. 1 shows one example of a wireless network 100, various changes may be made to fig. 1. For example, wireless network 100 may include any number of enbs and any number of UEs in any suitable arrangement. Further, the eNB 101 may communicate directly with any number of UEs and provide these UEs with wireless broadband access to the network 130. Similarly, each

eNB

102 and 103 may communicate directly with the network 130 and provide direct wireless broadband access to the network 130 to the UEs. Further, the

enbs

101, 102 and/or 103 may provide access to other or additional external networks, such as an external telephone network or other types of data networks. Further, the eNB 101-103 may transmit data indicating the status of the network, such as warnings and key performance indicators, to the server. The server may detect and diagnose problems in the network and provide instructions as to which corrective actions to perform based on the detected and diagnosed problems.

Fig. 2a and 2b illustrate example wireless transmit and receive paths according to the present disclosure. In the following description, transmit path 200 may be described as being implemented in an eNB (such as eNB 102), while receive path 250 may be described as being implemented in a UE (such as UE 116). However, it should be understood that the receive path 250 may be implemented in an eNB and the transmit path 200 may be implemented in a UE.

The transmit path 200 includes a channel coding and modulation block 205, a serial-to-parallel (S-to-P) block 210, an Inverse Fast Fourier Transform (IFFT) block 215 of size N, a parallel-to-serial (P-to-S) block 220, an add cyclic prefix block 225, and an up-converter (UC) 230. Receive path 250 includes a down-converter (DC)255, a remove cyclic prefix block 260, a serial-to-parallel (S-to-P) block 265, a size N Fast Fourier Transform (FFT) block 270, a parallel-to-serial (P-to-S) block 275, and a channel decoding and demodulation block 280.

In transmit path 200, a channel coding and modulation block 205 receives a set of information bits, applies a coding, such as Low Density Parity Check (LDPC) coding, and modulates the input bits, such as using Quadrature Phase Shift Keying (QPSK) or Quadrature Amplitude Modulation (QAM), to generate a sequence of frequency domain modulation symbols. The serial-to-parallel block 210 converts (e.g., demultiplexes) the serial modulation symbols into parallel data to generate N parallel symbol streams, where N is the IFFT/FFT size used in the eNB 102 and the UE 116. IFFT block 215 of size N performs an IFFT operation on the N parallel symbol streams to generate a time domain output signal. Parallel-to-serial block 220 converts (e.g., multiplexes) the parallel time-domain output symbols from size N IFFT block 215 to generate a serial time-domain signal. Add cyclic prefix block 225 inserts a cyclic prefix to the time domain signal. Upconverter 230 modulates (e.g., upconverts) the output of add cyclic prefix block 225 to an RF frequency for transmission over a wireless channel. The signal may also be filtered at baseband before conversion to RF frequency.

The transmitted RF signals from the eNB 102 arrive at the UE116 after passing through the wireless channel, and the reverse operations are performed at the UE116 from those at the eNB 102. Downconverter 255 downconverts the received signal to baseband frequency and remove cyclic prefix block 260 removes the cyclic prefix to generate a serial time-domain baseband signal. Serial-to-parallel block 265 converts the time-domain baseband signal to parallel time-domain signals. An FFT block 270 of size N performs an FFT algorithm to generate N parallel frequency domain signals. The parallel-to-serial block 275 converts the parallel frequency domain signals to a sequence of modulated data symbols. Channel decode and demodulation block 280 demodulates and decodes the modulated symbols to recover the original input data stream.

Each of the eNBs 101-103 may implement a transmit path 200 similar to the transmission to the UE 111-116 in the downlink and may implement a receive path 250 similar to the reception from the UE 111-116 in the uplink. Similarly, each of the

UEs

111 and 116 can implement a transmit path 200 for transmission to the

eNB

101 and 103 in the uplink and can implement a receive path 250 for reception from the

eNB

101 and 103 in the downlink.

Each of the components in fig. 2a and 2b may be implemented using hardware only or using a combination of hardware and software/firmware. As a specific example, at least some of the components in fig. 2a and 2b may be implemented in software, while other components may be implemented in configurable hardware or a mixture of software and configurable hardware. For example, FFT block 270 and IFFT block 215 may be implemented as configurable software algorithms, wherein the value of size N may be modified according to an embodiment.

Furthermore, although described as using an FFT and IFFT, this is for illustration only and should not be construed as limiting the scope of the disclosure. Other types of transforms may be used, such as Discrete Fourier Transform (DFT) and Inverse Discrete Fourier Transform (IDFT) functions. It should be understood that the value of the variable N may be any integer (such as 1, 2, 3, 4, etc.) for DFT and IDFT functions, and any integer raised to a power of 2 (such as 1, 2, 4, 8, 16, etc.) for FFT and IFFT functions.

Although fig. 2a and 2b show examples of wireless transmission and reception paths, various changes may be made to fig. 2a and 2 b. For example, the various components in fig. 2a and 2b may be combined, further subdivided, or omitted, and additional components may be added according to particular needs. Furthermore, fig. 2a and 2b are intended to illustrate examples of transmission and reception path types that may be used in a wireless network. Any other suitable architecture may be used to support wireless communications in a wireless network.

Fig. 3a illustrates an example UE116 according to the present disclosure. The embodiment of the UE116 shown in fig. 3a is for illustration only, and the

UE

111 and 115 of fig. 1 may have the same or similar configuration. However, the UE has various different configurations, and fig. 3a does not limit the scope of the present disclosure to any particular implementation of the UE.

In some embodiments, the UE116 receives the warning data and the performance critical indicator data from the eNB 101-103 in order to detect network anomalies, diagnose anomalies, and provide indications to correct detected anomalies. In some embodiments, detecting network anomalies, diagnosing anomalies, and providing an indication to correct detected anomalies may be performed fully or partially automatically.

The UE116 includes an antenna 305, a Radio Frequency (RF) transceiver 310, Transmit (TX) processing circuitry 315, a microphone 320, and Receive (RX) processing circuitry 325. The UE116 also includes a speaker 330, a main processor 340, an input/output (I/O) Interface (IF)345, an input 350, a display 355, and a memory 360. Memory 360 includes a basic Operating System (OS) program 361 and one or more applications 362.

The RF transceiver 310 receives from the antenna 305 an incoming RF signal transmitted by an eNB of the network 100. The RF transceiver 310 downconverts the incoming RF signal to generate an Intermediate Frequency (IF) or baseband signal. The IF or baseband signal is sent to RX processing circuitry 325, which generates a processed baseband signal by filtering, decoding, and/or digitizing the baseband or IF signal. RX processing circuitry 325 transmits the processed baseband signal to speaker 330 (such as for voice data) or to main processor 340 for further processing (e.g., for web browsing data).

TX processing circuitry 315 receives analog or digital voice data from microphone 320 or other outgoing baseband data (such as network data, e-mail, or interactive video game data) from main processor 340. TX processing circuitry 315 encodes, multiplexes, and/or digitizes the outgoing baseband data to generate a processed baseband or intermediate frequency signal. RF transceiver 310 receives the outgoing processed baseband or IF signal from TX processing circuitry 315 and upconverts the baseband or IF signal to an RF signal for transmission over antenna 305.

Main processor 340 may include one or more processors or other processing devices and executes basic OS programs 361 stored in memory 360 to control overall operation of UE 116. For example, main processor 340 may control the reception of forward channel signals and the transmission of reverse channel signals by RF transceiver 310, RX processing circuitry 325, and TX processing circuitry 315 in accordance with well-known principles. In some embodiments, main processor 340 includes at least one microprocessor or microcontroller.

Main processor 340 is also capable of executing other processes and programs resident in memory 360, such as the operations described in embodiments of the present disclosure for root cause analysis and automation using machine learning. Main processor 340 may move data into or out of memory 360 as execution proceeds. In some embodiments, main processor 340 is configured to execute applications 362 based on OS programs 361 or in response to signals received from an eNB or operator. Main processor 340 is also coupled to I/O interface 345, which provides UE116 with the ability to connect to other devices, such as laptop computers and handheld computers. I/O interface 345 is the communication path between these accessories and main processor 340.

Main processor 340 is also coupled to input 350 and display 355. An operator of the UE116 may use a keypad, such as the input 350, to enter data into the UE 116. Display 355 may be a liquid crystal display or other display capable of presenting text and/or at least limited graphics (such as from a website).

Memory 360 is coupled to main processor 340. A portion of memory 360 may include Random Access Memory (RAM) and another portion of memory 360 may include flash memory or other Read Only Memory (ROM). Although fig. 3a shows one example of the UE116, various changes may be made to fig. 3 a. For example, the various components in FIG. 3a may be combined, further subdivided, or omitted, and additional components may be added according to particular needs. As a particular example, main processor 340 may be divided into multiple processors, such as one or more Central Processing Units (CPUs) and one or more Graphics Processing Units (GPUs). Further, while fig. 3a shows the UE116 configured as a mobile phone or smartphone, the UE may be configured to operate as other types of mobile or fixed devices.

Fig. 3b illustrates an example eNB 102 in accordance with this disclosure. The embodiment of eNB 102 shown in fig. 3b is for illustration only, and the other enbs of fig. 1 may have the same or similar configurations. However, the eNB has various different configurations, and fig. 3B does not limit the scope of the present disclosure to any particular implementation of the eNB. Note that eNB 101 and eNB 103 may include the same or similar structure as eNB 102.

As shown in FIG. 3b, the eNB 102 includes multiple antennas 370a-370n, multiple RF transceivers 372a-372n, Transmit (TX) processing circuitry 374, and Receive (RX) processing circuitry 376. In some embodiments, one or more of the plurality of antennas 370a-370n comprises a 2D antenna array. The eNB 102 also includes a controller/processor 378, memory 380, and a backhaul or network interface 382.

The RF transceivers 372a-372n receive incoming RF signals, such as signals transmitted by UEs or other enbs, from the antennas 370a-370 n. RF transceivers 372a-372n downconvert incoming RF signals to generate IF or baseband signals. The IF or baseband signal is sent to RX processing circuitry 376, which generates a processed baseband signal by filtering, decoding, and/or digitizing the baseband or IF signal. The RX processing circuitry 376 transmits the processed baseband signals to the controller/processor 378 for further processing.

TX processing circuitry 374 receives analog or digital data (such as voice data, network data, e-mail, or interactive video game data) from controller/processor 378. TX processing circuitry 374 encodes, multiplexes, and/or digitizes the outgoing baseband data to generate a processed baseband or IF signal. RF transceivers 372a-372n receive the outgoing processed baseband or IF signals from TX processing circuitry 374 and upconvert the baseband or IF signals to RF signals for transmission over antennas 370a-370 n.

Controller/processor 378 may include one or more processors or other processing devices that control overall operation of eNB 102. For example, the controller/processor 378 may control the reception of forward channel signals and the transmission of reverse channel signals by the RF transceivers 372a-372n, RX processing circuitry 376, and TX processing circuitry 374 according to well-known principles. The controller/processor 378 may also support additional functions, such as higher-level wireless communication functions. For example, the controller/processor 378 may perform a Blind Interference Sensing (BIS) process, such as that performed by a BIS algorithm, and decode the received signal minus the interfering signal. The controller/processor 378 may support any of a variety of other functions in the eNB 102. In some embodiments, controller/processor 378 includes at least one microprocessor or microcontroller.

Controller/processor 378 is also capable of executing programs and other processes resident in memory 380, such as a base OS. The controller/processor 378 can also support root cause analysis and automation using machine learning as described in embodiments of the present disclosure. In some embodiments, the controller/processor 378 supports communication between entities, such as a web RTC. Controller/processor 378 may move data into and out of memory 380 as execution proceeds. The memory 380 stores various Artificial Intelligence (AI) algorithms for estimating the UE speed and training data sets for training the various AI algorithms.

Controller/processor 378 is also coupled to a backhaul or network interface 382. The backhaul or network interface 382 allows the eNB 102 to communicate with other devices or systems over a backhaul connection or over a network. Backhaul or network interface 382 may support communication via any suitable wired or wireless connection(s). For example, when eNB 102 is implemented as part of a cellular communication system (such as a 5G, LTE or LTE-a enabled system), backhaul or network interface 382 may allow eNB 102 to communicate with other enbs over a wired or wireless backhaul connection. When eNB 102 is implemented as an access point, backhaul or network interface 382 may allow eNB 102 to communicate over a wired or wireless local area network or over a wired or wireless connection to a larger network (e.g., the internet). Backhaul or network interface 382 includes any suitable structure that supports communication over a wired or wireless connection, such as an ethernet or RF transceiver.

In certain embodiments, the eNB 102 transmits the warning data and the key performance indicator data to a server via the backhaul or network interface 382 to detect network anomalies, diagnose the anomalies, and provide an indication to correct the detected anomalies. In some embodiments, detecting network anomalies, diagnosing anomalies, and providing an indication to correct detected anomalies may be performed fully or partially automatically.

Although fig. 3b illustrates one example of an eNB 102, various changes may be made to fig. 3 b. For example, eNB 102 may include any number of each of the components shown in fig. 3 b. As a particular example, the access point may include multiple interfaces 382, and the controller/processor 378 may support routing functions to route data between different network addresses. As another particular example, while shown as including a single instance of TX processing circuitry 374 and a single instance of RX processing circuitry 376, eNB 102 may include multiple instances of each (such as one for each RF transceiver).

In certain embodiments of the present disclosure, the BS provides various data sources designated as Key Performance Indicators (KPIs) to an Operations Support System (OSS). The OSS may be part of a server associated with the backhaul or network interface 382 of fig. 3 b. KPIs may be sent periodically, such as every fifteen minutes or another preset time interval. KPIs may include Performance Management (PM) counters. The PM data reflects the state and behavior of the system. For convenience, the terms "KPI data" and "PM data" are used interchangeably in this patent document to refer to information that is periodically transmitted from an eNB. A subset of these data is called the Key Quality Indicator (KQI). The KQI provides an aggregate index reflecting service accessibility, service retainability, service availability, quality of service, and service mobility levels. In addition to PM data, the BS reports Fault Management (FM) data or alerts that are triggered in response to one or more unexpected events at the BS, such as CPU overload, memory overload, DSP reboot, MME failure, and the like. For convenience, the terms "FM data" and "alert data" are used interchangeably in this patent document to refer to transmitted information indicative of an alert.

Troubleshooting is triggered in response to detecting that one or more KQIs fall outside of a threshold or nominal value. The troubleshooting process involves manually or automatically reasoning steps for inferring root cause explanations for the lowering of KQI. The root cause is obtained by detecting and diagnosing KPI anomalies, thereby providing fine-grained causal information for KQI anomalies. Other sources may be included during this process, such as call subscriber tracking, trouble tickets, and customer complaint information.

For example, when the quality of service KQI indicates an IP throughput degradation anomaly, the likely root cause may be low traffic demand or high radio frequency interference. Once the Root Cause Analysis (RCA) of the anomaly is complete, the recovery steps may range from simply resetting the BS, or altering operation and maintenance (OAM) parameters (e.g., transmission power, electrical tilt) at the BS.

Manual troubleshooting requires human domain experts to participate in each RCA step, including problem detection, diagnosis, and problem recovery. Since each BS reports thousands of KPIs during a single reporting interval (which may occur every 15 minutes or another preset interval), which occurs periodically at different preset time intervals, it is not a trivial matter for a human expert to process large amounts of data. Additional costs are incurred to reproduce the problem (through drive testing), test different solutions, and verify that the final solution repairs the underlying problem.

Fig. 4a illustrates an RCA framework 400 for using machine learning, according to various embodiments of the present disclosure. RCA framework 400 describes the process of applying machine learning to anomaly detection and root cause analysis. RCA framework 400 may be performed by an electronic device, such as a server associated with one or more of the enbs of fig. 1 and 3b or a UE, such as one of the UEs of fig. 1 and 3 a. For example, RCA framework 400 may be included in UE116 of fig. 3 a. Also for example, RCA framework 400 may be included in a server (with similar components UE116 of fig. 3 a) that receives data (such as KPIs and alert data) over backhaul or network interface 382 of fig. 3 b. RCA framework 400 receives data, detects anomalies from the data, and finds the root cause of the detected anomalies, and then performs corrective action. The embodiment shown in fig. 4 is for illustration only. Other embodiments may be used without departing from the scope of this disclosure.

RCA framework 400 is a rule-based framework that derives both rules and utilizes derived rules. The rule is in the form of the following formula (1). The term on the left side of equation (1) is the antecedent, while the term on the right side is the consequence. For example, KPIs are a precursor cause, while KQI are a consequence. That is, equation (1) describes the causal relationship between KPI and KQI.

Formula (1)

RCA framework 400 receives data from various sources, such as source 402. The source 402 may be an eNB, such as

eNB

101, 102 or 103 of FIG. 1, or a UE, such as

UE

111 and 116. The source 402 may also include a core network and a Radio Access Network (RAN). The received data may include KPI data and alerts. KPI data is received periodically at predetermined time intervals (e.g., every 15 minutes), while alerts are events generated at the eNB in response to hardware or software failures.

The OSS 404 receives data from various sources 402. OSS 404 may include one or more information repositories for storing alert data (such as FM data), KPIs including PMs, and the like. KPIs and alerts indicate overall network health. KPIs may take the form of counters (e.g., number of RRC connection attempts) or measurements (metrics), such as average IP throughput per eNB over the last reporting interval. For example, real-time KPIs may come from various enbs and arrive at periodic Reporting Intervals (RIs). An alert is an event indicating a problem or condition caused by a hardware or software process. This information is used to ascertain whether there is an anomaly or failure in the E-UTRA network, determine the root cause, and perform corrective actions.

In certain embodiments, the OSS 404 pre-processes the KPIs and the warning data. In other embodiments, the pre-processing of the KPIs and warning data is performed outside the OSS 404. Preprocessing involves manipulating different information sources from the eNB. Since a large amount of data is received, the data may be in a different or different format to perform the preprocessing. Thus, the preprocessing modifies the data to make it available in the RCA.

Preprocessing the KPIs, PM data, and warning data enables the data to be further processed by the batch processing layer 406 and the speed processing layer 408. Preprocessing involves populating missing values with special entries, such as null values (NaN), to ensure that they are not accidentally included during further downstream processing. Preprocessing can include discretizing KPI data (including PM data), deriving a composite PM metric, and generating a KPI hierarchy. The output of the preprocessed data produces data that can be used to detect anomalies and infer corresponding root causes of the detected anomalies. The pre-treatment is discussed in FIGS. 5a-5h below.

The batch layer 406 and the speed processing layer 408 receive data. The batch layer 406 generates rules for RCA from the historical data based on the detected anomalies. The rules identify anomalies from the data and suggest the cause of the anomalies. The rules may also present one or more remedial actions that will resolve the detected anomaly. The speed processing layer 408 uses the rules generated for RCA to detect anomalies from the real-time data and resolve the detected anomalies.

An anomaly is a symptom indicating that one or more quality of service metrics of the operator network are out of their normal range and needs to be troubleshooting to resolve the underlying cause. Anomaly detection involves detecting anomalies or deviations of one or more of these KQI values from their normal operating range. The following equation (2) describes the conditions for identifying an abnormality. As shown in equation (2), an anomaly is identified when KQI is less than a threshold. After detecting an anomaly, RCA framework 400 then discovers the underlying root cause of the detected anomaly and then performs corrective action based on the derived rules. That is, the abnormality is a symptom, and RCA is performed to identify the cause of the symptom, and thereby determine a remedy for the detected abnormality so that the operator network is restored to normal.

Formula (2)

KQI≤T

The batch layer 406 receives data from the OSS 404 and saves the data in historical data 410. The historical data 410 includes previously processed KPIs and alerts. Previously processed KPIs and alerts are saved in historical data 410, and machine learning uses the historical data to identify anomalies in the data and generate rules (or conditions) to easily detect anomalies in real time (based on causal relationships) and to provide an understanding of the root cause of the problem. The rules may also provide the steps necessary to resolve the identified anomalies. In contrast, the speed processing layer 408 checks newly received data (such as real-time data) by applying rules generated in the batch processing layer 406 to detect anomalies.

Anomaly detectors

412a and 412b identify one or more anomalies from KPIs and warning data that indicate that the operator network is out of its normal range.

Anomaly detectors

412a and 412b detect anomalies in one or more KQI classes associated with accessibility, retainability, availability, integrity (such as quality), and mobility. Note that the anomaly detector 412a detects anomalies from the historical data 410, while the anomaly detector 412b detects anomalies from the real-time data. For example, anomaly detector 412a looks at previous data (via historical data 410) to identify sample patterns corresponding to operator network operations that are outside of its normal range. Anomaly detector 412b looks at the real-time data to identify patterns corresponding to operator network operations that are outside of their normal range.

In certain embodiments, there are multiple KQI classes corresponding to the anomaly. The KQI class of accessibility provides the end user with a probability of being provided with an E-UTRAN radio access bearer (E-RAB) upon request. For example, service accessibility indicates how easy a connection is to be established with a cell.

The maintainable KQI class provides a measure of the frequency with which end users abnormally lose the E-RAB during the time the E-RAB is used. For example, sustainability indicates the ability to maintain a connection.

The KQI class of availability provides a measure of the percentage of time available for a cell. For example, the availability indicates whether the system cannot detect the presence of a cell.

The KQI class of integrity provides a measure of IP throughput in both the download and upload directions. For example, integrity indicates user quality of experience and may be based on download speed.

The KQI class of mobility measures the behavior of the E-UTRAN mobility function. For example, mobility indicates whether the user experiences frequent call drops.

The KQI class of traffic measures the proportion of resource utilization at the cell during this RI. The greater amount of resource utilization means that the cell is over utilized and may degrade the quality of service within the cell.

In some embodiments, the inputs of the

anomaly detectors

412a and 412b are time stamped. The input may be provided in a streaming manner (such as one sample per RI per eNB) or in a batch manner (such as corresponding to historical data for one or more days/one or more months in a year). For each KQI Y e Y ═ accessibility, maintainability, availability, integrity, mobility, if the KQI value of the data is AN abnormal sample, then the

abnormality detectors

412a and 412b output a function AN whose value is non-zero_y. For each y, expressed as KPI_j(y) for a KPI where y is at level j in the KPI hierarchy. Table 1 below describes the different KPIs in the tree hierarchy that can be used for anomaly detection for a particular KQI class.

TABLE 1

Embodiments of the present disclosure provide three different anomaly detection methods for

anomaly detectors

412a and 412 b. The detection methods include (i) extreme outliers, (ii) standard deviations, and (iii) Median Absolute Deviations (MAD).

With respect to the extreme outlier method, the

anomaly detectors

412a and 412b identify an anomaly if at least one KPI within a KPI hierarchy (such as KPI tree hierarchy 520 of fig. 5 b) is below a threshold of some KQI. The threshold may be set as a fixed value or may be calculated based on hourly KPI statistics.

With respect to the standard deviation approach, the

anomaly detectors

412a and 412b identify an anomaly if one or more KPIs within a KPI hierarchy (such as the KPI tree hierarchy 520 of FIG. 5 b) are below k standard deviations of the mean of that KPI for a certain KQI. The standard deviation method is described in the following formula (3).

Formula (3)

With respect to MADs,

anomaly detectors

412a and 412b identify anomalies based on a median value of a set comprising the difference between each sample point and the sample median value. For example, if the KPI value is less than or equal to

An anomaly is detected in which, among other things,

indicating KPI during hourly intervals to which RI belongs_jThe median value of (a). The MAD_jIs a function of the median of the absolute deviation of the KPI from its hourly median KPI. Parameter t_jReferred to as an anomaly detection threshold, for controlling the number of anomalies detected for KPIj. The MAD method for anomaly detection is described in the following equation (4).

Formula (4)

For example, to identify for an input as KPI sequence Y and an anomaly detection threshold t

The first step is to extract t X ═ y: pr [ Y ≤ Y]Less than or equal to 0.2 }. It should be noted that values of Y higher than the 20 th quantile are not qualified as abnormal samples. The second step is to solve the MAD, where MAD ═ c × Median [ | X-Median [ X [ ]]|]. To identify C, the standard distribution of identified X's is based on a calculation

So that c is 1/z, where z is such that

And step 3: for each X located in X, the MAD method defines the definition of M-score ═ (X)_i-Median[X]) and/MAD. The M-score is compared to a threshold value such that if the M-score is less than the threshold value t, an anomaly is identified.

Machine learning training model 414 generates rules for identifying detected anomalies from historical data 410 through anomaly detector 412 a. Machine learning training model 414 also generates parameters for identifying anomalies from historical data 410. The rules and parameters may also be for performing RCA on the detected anomaly and providing one or more remedial actions to address the detected anomaly. Equation (5) below describes an example rule used in RCA. The generation of rules and parameters by machine learning training model 414 is discussed in more detail below.

Formula (5)

RCA framework 400 is also able to evaluate the quality of each derived rule. If S represents a collection of historical data 410, then

Representing a set of sample anomalies that contain a particular KQI. In a similar manner to that described above,

representing a set of relationships whose entries satisfy rule j. Equations (6), (7) and (8) describe various measures of the quality of each derived rule. For example, the support (support) of equation (6) is the relative proportion of the pre-reason for the KPI sample to comply with the rules. For example, a higher confidence score for equation (7) indicates that the resulting rule is subject to more and more samples with KQI anomalies than normal samples. The confidence score of equation (7), also referred to as precision or a posterior value, refers to the score of the abnormal sample that follows rule j. This corresponds to the conditional probability Pr (s. epsilon. A)_KQIS ∈ F). Higher confidence rules better distinguish anomalous samples from normal samples. Similarly, the higher the hit rate of equation (8), the higher the indication that the derived rule can be applied to most KQI outlier samples. The hit rate, also referred to as a recall or likelihood, of equation (8) corresponds to the score of the KQI anomaly associated with rule j. This corresponds to the conditional probability Pr (s ∈ F | s ∈ A)_KQI)。

Formula (6)

Formula (7)

Formula (8)

Once generated by the machine learning training model 414, the rules and parameters that meet the design goals, such as minimum support and confidence (which may be any preset threshold, e.g., 80%), are saved in the information repository 418. Information store 418 may be similar to memory 360 of FIG. 3a or memory 380 of FIG. 3 b. Information store 418 may store one or more rules for identifying anomalies and root causes of the anomalies. The rules stored in the information store 418 may be used as fingerprints to check for matching rule(s), e.g., to check whether KPIs of an exception sample comply with the rules. If a rule match is found, an interpretation may be provided at the user interface to interpret the physical meaning of the rule so that a human operator may infer the underlying root cause of the occurrence of the anomaly.

In certain embodiments, information store 418 also includes a knowledge base derived using human domain expertise. In certain embodiments, the data within information repository 418 is applied to the PM to prepare the data in a format that is modifiable for performing RCA and corrective actions.

Once the rules are generated that suggest a performance degradation with high confidence, RCA framework 400 identifies which of the generated rules correspond to root causes and which of the generated rules correspond to associated symptoms. The following three examples describe identifying which rules correspond to root causes of KPI degradation.

In some embodiments, a chained rule R ∈ R is a set of rules for certain KPIs, considered as

With a high confidence. Linking the rules together is described in more detail below with respect to fig. 5 c. Thereafter, relationships between the base rules in r using the directed graph are established. The set of rules in r is first divided into ground rules and non-ground rules, and all ground rules are treated as nodes in the graph. Next, if the rule

Confidence of (1) P (KPI)_j＝1-|KPI_i1) is above the threshold, all non-ground rules are satisfied and considered in a pair of ground rules (KPI)_i＝1，KPI_j1) with directed edges added between them. Thus, directed edges indicate causal relationships between nodes, i.e. KPIs _j1 is KPI_iSymptom of 1, or KPI _i1 results in KPI_jOccurrence of 1. Nodes without incoming edges indicate that they are not caused by any other underlying rule and are therefore defined as the root cause of the chain rule, while the rest are identified as symptoms.

In another embodiment, when constructing a KPI hierarchical tree (as described in fig. 5c below), the processor may identify leaf nodes of chained rules as root causes and other rules as incidental symptoms.

In yet another embodiment, all generated rules are considered nodes in the directed graph. Edges between the nodes are then added using a method similar to that described above (in the first example). If a bidirectional edge exists between any two nodes, the processor folds it into one node. Nodes without incoming edges are defined as the root cause of the KQI. All possible paths from these nodes to the KQI node are now a new set of chained/compound rules, where the nodes on the path are the symptoms.

With respect to the speed processing layer 408, after the anomaly detector 412b identifies anomalies in real-time, the RCA 420 uses information from the information repository 418 to identify the cause of the detected anomalies. For example, RCA 420 uses machine-learned rules (from machine-learned training models 414 above confidence and hit rate thresholds) to identify the underlying root cause of the detected anomaly.

In some embodiments, after RCA 420 determines the root cause of the detected anomaly, an explanation of the root cause may be displayed on a user interface along with remedial action 422. When remedial action 422 is displayed, it may include recommending action(s) to perform to restore the network to its normal operating state. In other embodiments, remedial action 422 performs the necessary actions to automatically restore the network to its normal operating state. Remedial action 422 may apply corrective action based on the determined root cause that results in a reduction in the KQI of interest.

After determining the root cause and interpretation, the next step of RCA framework 400 is to perform the set of corrective actions (remedial action 422) to correct the underlying cause of the anomaly and restore the network to its normal operational state.

In some embodiments, to determine corrective actions, RCA framework 400 uses a machine learning inference engine (not shown). The machine learning inference engine identifies a correct set of recovery actions based on a set of feature inputs. For example, the machine learning inference engine may be trained using the label data provided from the historical data 410. For example, the machine learning inference engine is trained using job logs, CM data, and the like saved in the historical data 410. The training labels provide the best corrective action c for each feature vector in the training set x.

To generate training for a given job log and CM data, CM data comprising eNB configurations stored across multiple dates is compared between one day and the next, and only changes applicable to configuration parameters are retained. This is done by regularizing the chronological order against the corresponding date when the abnormality is resolved. Thus, marker data is generated that identifies the nature and size of eNB configuration parameter changes, and what the underlying root cause/symptom to apply these changes.

The input to the machine learning inference engine is a feature vector x ═ x₁，x₂，...x_n]. Each entry in the vector is pre-processed PM data, which is either a continuous value or a discrete value. Examples of entries may include KPIs (such as dlschduler mcsratio, ulschduler mcsratio, rsispath 0Avg, rsispath 1Avg, dlresulualblererrans 0, described below). The output of the machine learning inference engine is the corrective action C ∈ { C }₁，C₂，...C_K}. Possible actions include increasing antenna downtilt (to reduce overshoot), decreasing antenna downtilt (to increase RF coverage), adjusting one or more RRC parameters (such as UE power control parameters, RS-to-data energy per resource element), or adjusting one or more scheduling parameters (to increase the priority set for GBR radio bearers).

As described above, preprocessing the PM data and the warning data enables the data to be further processed by the batch processing layer 406 of fig. 4a and the speed processing layer 408 of fig. 4 a. The preprocessing of the PM data is based on (i) discretizing the PM data, (ii) deriving a composite PM metric, and (iii) generating a KPI hierarchy. The output of the processed PM data may be used to detect anomalies and their corresponding inferred root causes. Processing the warning data includes correlating the PM data with the warning data. Fig. 5a-5h depict the pre-processing of PM data and warning data. The embodiment of fig. 5a-5h is for illustration only. Other embodiments may be used without departing from the scope of this disclosure.

The KPI discretization improves the rule mining and processing efficiency of the subsequent RCA. In certain embodiments, continuous value PM data may be used for rule mining and processing of subsequent RCAs, however, discretized PM data is more efficient than using continuous value PM data.

Discretization can be performed based on comparing KPI data to fixed or statistical thresholds. Discretization using statistical thresholds involves first obtaining a statistical distribution of PM data over a particular time interval (e.g., daily, hourly, etc.). The distribution may be obtained at a pre-specified quantile value (e.g., a 10 th quantile, a 70 th quantile, or a 100% quantile). Next, the PM data is binned to determine to which quantile interval the PM data belongs. For example, bin 0 may correspond to PM data located below the 10 th quantile, bin 1 may correspond to PM data located between the 10 th quantile and the 90 th quantile, and bin 2 may correspond to an abnormal value (such as PM data greater than the 90 th quantile). It should be noted that for synthetic PMs, special quantization bins are set aside in the event that the PM is invalid (such as being set to a null value if no measurement is available).

In some embodiments, the threshold for discretizing the PM data of the RCA may be selected according to the threshold employed for classifying the KQI anomaly (e.g., extracting the rule in the form of equation (2) above). For example, if the anomaly detection threshold is selected such that a greater number of samples are declared to experience an anomaly at the KQI of interest. The PM threshold for RCA purposes may then be correspondingly changed so that RCA framework 400 will attempt to find the root cause of a larger number of KQI outlier samples.

Adaptively adjusting the PM threshold for RCA purposes as a function of the threshold for detecting reduced KQI anomalies provides a reasonable compromise between high rule confidence (based on equation (7), which corresponds to the fraction of samples for which the rule applies that are also declared anomalous) and high hit rate (based on equation (8), which corresponds to the fraction of anomalous samples for which the rule applies).

Fewer occurrences of a KQI event are classified as abnormal if the threshold for marking a certain KQI sample as abnormal is chosen more stringent. For example, an anomaly detection algorithm that classifies samples below 1Mbps throughput as anomalous will classify anomalous samples at a lesser throughput than an anomaly detection algorithm that classifies samples below 5 Mbps.

In certain embodiments, RCA framework 400 also performs KPI prediction based on IP throughput using historical data 410. Fig. 4b illustrates a block diagram 430 of traffic prediction based on anomaly detection, in accordance with various embodiments of the present disclosure. Fig. 4c illustrates a block diagram 450 of traffic prediction based on anomaly prediction, in accordance with various embodiments of the present disclosure.

In certain embodiments, RCA framework 400 may automatically predict data at a future time based on a trained predictive model. An exception warning occurs once the difference between the predicted data and the actual incoming data is greater than some particular threshold (rise). For example, a machine learning framework for traffic prediction based on anomaly detection and anomaly prediction, respectively, would improve the efficiency of RCA framework 400.

Machine learning based KPI prediction uses feature engineering (feature engineering) based on historical data. Single-feature methods may be based on IP throughput, while multi-feature methods may be based on various parameters, such as hours of the day, number of active users within a particular geographic area, and so forth. The RCA framework 400 using Long Short Term Memory (LSTM) neural networks or quantile regression forests can detect and predict IP throughput anomalies in the network on a daily, monthly, quarterly, yearly (and various other time intervals). For example, at each time stamp, RCA framework 400 may predict the amount of data that will occur in the future. Upon receiving the actual data, RCA framework 400 compares the amount of predicted data with the actual amount of data. The anomaly detector 412b identifies an anomaly if the difference between the predicted data and the actual data exceeds a threshold.

An example data set for time series traffic prediction is shown in table (2) below. The data set of table 2 describes system record data including different KPIs throughout the network. KPIs take the form of counters (e.g., number of RRC connection attempts) or measurements (metrics), such as average IP throughput per eNB over the last reporting interval. Table 2 gives a description of the features used in anomaly detection and anomaly prediction.

TABLE 2

Block 430 of fig. 4b depicts a machine learning assisted anomaly detection framework based on traffic prediction. The system first collects historical data, a subset of which is used to train a traffic prediction model based on deep learning/machine learning techniques. The historical data used for traffic prediction may be the same as historical data 410 of fig. 4 a. Based on the prediction model, the system predicts data traffic or IP throughput on the uplink or downlink in future timestamps. Note that the term "data" may refer to the amount of uplink or downlink traffic, IP throughput, etc. To evaluate prediction accuracy, the validation data 436 obtains the prediction data 438a and also compares it to actual data that comes at a future time stamp. Since the data set is a time series, an error vector can be generated by comparing the difference between the predicted data vector and the reference true phase data (ground true) vector. Once the error vector is obtained, the RCA framework 400 may calculate statistical parameters of the error, such as empirical mean, variance, standard deviation, etc., and record them in a database.

In a streaming data scenario, at each timestamp, the model predicts the data that will occur in the future timestamp. When the time stamp receives actual data, the predicted data is compared with a reference true phase (actual data). If the difference between them exceeds an absolute or relative threshold, the system declares an exception. The choice of the anomaly detection threshold depends on which anomaly detection method is used.

The input data is a set of data in past timestamps, which may be single or multiple features. The actual data is the data coming in the future time stamp. The training data 432 includes historical data collected by the system for training the predictive model. A predictive model 434 is trained based on a training data set 432. For example, building the predictive model 434 is based on deep learning (such as LSTM) and machine learning methods (quantile regression forest). The validation data set 436 generates an error vector by calculating the difference between the predicted data vector and the actual data vector. The calculation parameters 438b take the error vector from the validation dataset 436 and identify the mean, variance, standard deviation, or other parameters that relate to the statistical distribution of the error vector. The test data set 440 is a data set for real-time testing. The validation dataset 436 includes historical data for validating the predictive model and generating parameters. Detect exception 444 detects whether the actual data is anomalous. This function works differently based on which anomaly detection method the system will select. For the standard deviation method, the detected anomaly 444 detects an anomaly by checking whether the error between the predicted data and the actual data is greater than a given threshold, which is determined by the parameters obtained in the verification. For the Tukey approach, detect anomaly 444 chooses not to use parameters obtained in the validation dataset, but to detect anomalies based only on the predicted data interval and the actual data.

The block diagram 450 of fig. 4c depicts a flow prediction based machine learning assisted anomaly prediction framework. Similar to the anomaly detection based on flow prediction of FIG. 4b, the block diagram 450 of FIG. 4c first obtains a prediction model based on historical data. The block diagram then groups the historical data based on timestamps in the same time period (such as the same hour, or the same hour-minute). For example, the timestamp is at 2:00 to 3 in the afternoon: all past data in the time interval of 00 belong to the same group. Once the grouped data set is constructed, the system calculates parameters, such as mean, standard deviation, for each group. In the anomaly prediction phase, at each timestamp, the system first predicts the data that will come at the future timestamp. The system then predicts whether the predicted flow constitutes an anomaly by comparing the predicted data at that particular time stamp with an expected value or interval determined by statistical information from historical data.

The building packet data set groups historical data (such as historical data 410 of fig. 4 a) based on timestamps of the historical data. For data with the same timestamp in hours and minutes, the system puts them into the same group. The calculation parameters 454 identify the mean, variance, standard deviation, and other statistical parameters for each data set group. The predicted anomaly 456 predicts an anomaly based on data that will come in a future timestamp. The function works differently based on the anomaly prediction method selected by the RCA framework 400. For the standard deviation approach, at each timestamp, the predicted anomaly 456 predicts the anomaly by checking whether the error between the predicted data and the mean of the historical data in the next timestamp group is greater than a given threshold, which is determined by the parameters. For the Tukey method, at each timestamp, the predicted anomaly 456 predicts the anomaly by checking whether the mean of the historical data in the next set of timestamps falls within a range dictated by the prediction interval.

In certain embodiments, KPI prediction based on IP throughput using historical data 410 comprises data pre-processed using moving window averages. Since the raw traffic data arrives periodically at specific intervals (such as 15 minute intervals), including various bursts, the preprocessing step applies a filter to reduce its variance. In certain embodiments, the processor calculates a moving average over the raw data. The moving average is based on an expression

It should be noted that it is preferable that,

is the data averaged over a moving window at timestamp i, x (i) is the raw data at timestamp i, and T_wIs the size of the moving window. Note that applying a moving window does not change output versus input sizeSmall size. If KPI is reported every 15 minutes, T_wSet to 1 hour, which covers four sample points in the window. That is, RCA framework 400 loads the raw data and then takes a moving window average to smooth the network traffic curve.

In certain embodiments, KPI prediction based on IP throughput using historical data 410 comprises data that is pre-processed using data binning. Data binning is mainly used for deep learning and multi-feature input. The features used in LSTM prediction other than downlink/uplink throughput are: UEActiveDLAvg/UEActiveULAvg, TotPrbdDLAvg/TotPrrbULAvg, and a timestamp. The data is then quantized into a plurality of bins.

For example, for the number of active users in dl (ueactivelavg) and ul (ueactivelavg), the average number of UEs in downlink/uplink is binned into four groups. For example, (i) bin 0 corresponds to an average number of UEs between 0 and 5, (ii) bin corresponds to an average number of UEs between 5 and 10, (iii) bin 2 corresponds to an average number of UEs between 10 and 20, and (iv) bin 3 corresponds to an average number of UEs exceeding 20.

As another example, TotPrrbDLAvg/TotPrrbULAvg, the quantized PRB utilization in the downlink/uplink is binned into four bins. For example, (i) the average ratio of PRBs between bin 0 and 0 to 30 corresponds, (ii) the average ratio of PRBs between bin 1 and 30 to 65 corresponds, (iii) the average ratio of PRBs between bin 2 and 65 to 95 corresponds, (iv) the average ratio of PRBs between bin 3 and 95 to 100 corresponds.

As another example, the time stamps may be binned in hours. That is, bin 0 may correspond to hours between 0:00 am and 6:00 am, bin 1 may correspond to hours between 6:00 am and 12:00 am, bin 2 may correspond to hours between 12:00 pm and 6:00 pm, and bin 3 may correspond to hours between 6:00 pm and 12:00 am.

In some embodiments, time series traffic prediction may be based on LSTM and machine learning, where the inputs may be (i) a single feature input, (ii) multiple feature inputs, and (iii) a clustered single feature input. That is, the LSTM predicts throughput data in future timestamps. For example, for a given input x, there may be two LSTM layers, followed by a fully connected (dense) layer to generate a single output o. The dense layer has a single output corresponding to the predicted flow. Since the performance of time series flow prediction depends on the choice of input features, different inputs (such as single feature input, multiple feature inputs, single feature input of a cluster) may change the result.

For a single input feature, only data for the single feature is input to the LSTM to train the predictive model. A single input feature may be expressed as

Wherein x is_p(i) Is the predicted data at time stamp i, x (i-j) is the actual data at time stamp i-j,

is a sequence of size L and L represents how many past timestamps the LSTM uses for prediction. In some embodiments, the input is a throughput sample over a past period.

For multiple feature inputs, the LSTM takes as input data clipped from multiple features. The plurality of feature inputs may be expressed as

Wherein x is_p(i) Is the predicted data at time stamp i, x^(K)(i-j) is the actual data of the kth feature at time stamp i-j, K is the total number of features as input,

is a data sequence in the kth feature of size L, and L represents the number of past timestamps used for prediction. As in the previous section, here LSTM represents an overall deep learning network with a dense tier size of 1.

It should be noted that the single input feature and the multiple feature inputs are from a single cell. The single input feature of the cluster performs traffic prediction for multiple cells simultaneously based on past data from all of these cells. That is to say thatA single input feature of a cluster may use data not only in the time domain, but also across the spatial domain (across multiple enbs and cells). To balance the computational cost and the predicted coverage of the cell, a single input feature of the cluster divides the entire network into multiple clusters, and the prediction model is trained per cluster. Thus, a single input feature of a cluster may be represented as

Wherein the content of the first and second substances,

is the predicted result for all N cells at the ith timestamp, x, in a given cluster_d(j) Is the actual data in cell d at timestamp j,

is the sequence of past data in the d cell, L represents the number of past time stamps, and N is the number of cells in the cluster. In some embodiments, a cluster may be a single cell or a group of cells within a particular geographic area. In other embodiments, clusters of cells may be grouped according to their cell coverage (using downlink transmission power, capabilities such as whether the cell is a macro cell or a pico cell), operating carrier frequency, radio access technology (e.g., 4G LTE, 5G new radio), and so forth. The above-described method may also include additional embodiments, such as multiple feature inputs available from each cell.

In some embodiments, the time series traffic prediction may be based on traffic prediction based on a quantile regression forest. The quantile regression forest is a modified version of the random forest algorithm and is used for high-dimensional regression and classification. Fig. 4d illustrates a framework 460 of an example quantile regression forest according to various embodiments of the present disclosure. Input 462 is an input feature. For example, for a given input X ═ X, the conditional distribution of Y is less than Y, F (Y | X ═ X) ═ P (Y ≦ Y | X ═ X), where Y is the dependent variable corresponding to the predictor variable X. The quantile regression forest approximates the condition distribution to F (Y | X ═ X) ═ P (Y ≦ Y | X ═ X) ═ E (1)_{Y≤y})|X＝x). The estimate of the distribution function is obtained by taking a weighted average of the leaves (such as

leaves

464a, 464b, and 464c) into which x falls. Since the alpha quantile is calculated as Q_α(x) Inf { y: p (Y ≦ Y | X ≧ X) ≧ α }, which may identify the prediction interval as I for a given input X and probability α_α(x)＝[Q_(1-α)/2(x)，Q_α(x)]. It is noted that only past throughput samples are used as input in a single cell, and the output is the predicted interval on throughput for the next timestamp. This can be expressed as

Wherein, I_α(x) Indicating the prediction interval at timestamp i, x (i-j) is the actual data at timestamp i-j,

is a data sequence of size L and L represents the number of past timestamps used for making the prediction. QRF represents the quantile regression forest.

As described above for fig. 4b, the flow prediction for anomaly detection may be based on standard deviation based anomaly detection, or tukey based anomaly detection. For standard deviation based anomaly detection, after the flow prediction model is built, the first step includes using the validation dataset 436 to measure the predictive capabilities of the model and calculate the prediction error. The prediction error is denoted err_val＝(x_p(1)-x(1)，x_p(2)-x(2)，...，x_p(T) -x (T)), wherein, err_valIs the error vector on the validation data set, x (i) is the actual data at time stamp i, x_p(i) Is the prediction data at timestamp i and T is the size of the prediction vector. Based on error vector err_valAnd the second step includes identifying the standard deviation. Block 4389b identifies a parameter such as a standard deviation. The standard deviation is expressed as std_val＝STD(err_val，T_w) Wherein, std_valIs the standard deviation on the error vector and STD represents the function that calculates the standard deviation. Here, T_wThe time window over which the data is used to calculate the standard deviation. For example, if T_wWith 8 and T24, there are 3 standard deviation values corresponding to their time ranges. Block 430 may store these parameters for future application anomaly detection. At run-time, at each timestamp i-1, the traffic prediction method automatically predicts the data that will come at the next timestamp i based on past data samples. For example, by comparing the predicted data with the actual data when | x is satisfied_p(i)-x(i)|≥k*std_valAn exception exists.

For tukey-based anomaly detection, a prediction interval is needed before applying anomaly detection. First, a prediction interval in a test data set is obtained. The interval is then divided into four quarters. Anomalies are detected based on these quarters. For example, when the traffic prediction method predicts data that will appear at timestamp I, the prediction interval I is obtained by running a quantile regression forest_α(x) In that respect For a given prediction interval I_α(x) It is divided into four quarters, where its boundaries are defined by Q1, Q2, and Q3. For example, if I_α(x)＝[1，3]Then Q is₁＝1.5，Q₂＝2，Q₃2.5. When the actual data x (I) arrives at the time stamp I, if x (I) is satisfied, < Q₁-3|Q₃-Q₁I or x (i) > Q₃+3|Q₃-Q₁If, then a serious anomaly may occur. When the actual data x (I) arrives at timestamp I, if Q is satisfied₁-3|Q₃-Q₁|＜x(i)＜Q₁-1.5|Q₃-Q₁I or Q₃+1.5|Q₃-Q₁|＜x(i)＜Q₃+3|Q₃-Q1|, then there may be a possible anomaly.

As described above for fig. 4c, the flow prediction for anomaly prediction may be based on standard deviation based anomaly prediction, or tukey based anomaly prediction. For standard deviation based anomaly prediction, first, historical data is grouped by time stamp, then parameters for each group are identified, and finally predicted anomalies 456 may predict anomalies from test data set 440. For example, after collecting historical data, the first step of block 450 groups the data based on their timestamps. That is to say, forAt the same data in hours and minutes, the system groups them into one group. The group can be expressed as

Wherein the content of the first and second substances,

is a set of timestamps of size T with the same hour-minute value k, and x_kAre the kth data set that contains all of them. After the datasets are partitioned into groups of time stamps, in a second step, the calculation parameters 438a identify the mean and standard deviation across each group. Mean is expressed as mean_k＝MEAN(x_k) And for each group the standard deviation is expressed as std_k＝STD(x_k，T_w). Please note that T_wThe same time window over which the data is used to calculate the STD. The system stores the mean and STD for each group of time stamps. After predicting the data that will appear at the next timestamp i, the third step of block 450 makes an anomaly prediction at the same timestamp based on the predicted data. For example, when | x_p(i)-mean_k|≥k*std_kThe anomaly at timestamp I is high.

For tukey-based anomaly prediction, the historical data is first grouped by time stamp and the parameters associated with each group are identified. Next, the prediction interval of the test data set is obtained and divided into four quarters. Finally, predictive exceptions 456 predict exceptions. For example, block 450 groups data by time stamp and calculates an average over each data group. Next, a prediction interval I is obtained at a time stamp I_α(x) And divides the prediction interval into four quarters. Predicted anomaly 456 at mean_k＜Q₁-3|Q₃-Q₁L or mean_k＞Q₃+3|Q₃-Q₁A high probability of predicting an abnormality.

In certain embodiments, after performing anomaly detection or anomaly prediction based traffic prediction, RCA framework 400 filters the results before reporting the results to a system administrator. Filtering reduces the probability of false positives. For example, the post-processing filter may be a throughput threshold, such that if the sample value (i) is below the threshold and the anomaly detection method declares the sample as an anomalous sample, the system issues an anomaly warning or (ii) exceeds the threshold, the filter declares the sample as not anomalous (whether or not the machine learning model declares the sample as anomalous). Different thresholds may be applied for different days, different times of day, different seasonal throughput averages, etc.

A block diagram 500, shown in fig. 5a, depicts discretized KPI data. Block diagram 500 may be performed by a server associated with RCA framework 400 of fig. 4A. Block diagram 500 depicts a function or database that includes a mapping that associates anomaly detection thresholds with quantization thresholds for discretizing KPI data. Further, the block diagram 500 depicts a function or database that includes a mapping that correlates geographic areas, times of day, operators, and quantization thresholds for discretizing KPI data. For example, for a threshold of 1Mbps, the discretized threshold may be based on the 90 percentile of the hourly KPI values. As another example, for a threshold of 0.1Mbps, the discretized threshold may be based on the 99 percentile of the hourly KPI values. As another example, for a threshold of 0.01Mbps, the discretized threshold may be based on the 99.9 percentile of the hourly KPI values. Note that similar rules may be formulated based on associations of day, geographic location, etc.

For example, threshold parameters for anomaly detection are selected (block 502). The selected threshold parameters (block 502) are received by an anomaly detection block 504. The anomaly detection block 504 is similar to the

anomaly detectors

412a and 412b of FIG. 4 a. Anomaly detection block 504 detects anomalies based on the selected threshold parameters. In block 508, a KQI is obtained from the KPI data 506. Anomaly detection block 504 receives the acquired KQI (KQI of block 508) and a threshold to generate a KQI anomaly sample list 510. Block 512 discretizes the KPI based on the KPI data 506 and a function 514, where the function 514 is based on the selected threshold parameters (block 502). The discrete PM data 516 is used for RCA 518. RCA 518 is similar to RCA 420 of FIG. 4 a.

In other embodiments, percentiles or absolute thresholds for discretized PM data are selected for RCAs in an iterative manner. For example, one or more thresholds for the PM data may be set as a starting value to evaluate the corresponding confidence score based on equation (7) and the hit rate based on equation (8). If none of the confidence score, the hit rate, or both the confidence score and the hit rate satisfy the design constraints, the threshold is iteratively adjusted accordingly until the design constraints are satisfied. In yet another embodiment, the iterative process of determining the PM threshold for RCA is repeatedly performed each time the threshold for detecting KQI anomalies is modified.

Note that for a given network, there are multiple types of geographic regions that will have different underlying patterns and statistics. For example, the flow demand in the morning and evening of a residential area can be high, while the flow demand at noon during the day people work can be reduced. In contrast, the traffic demand in an industrial or commercial area will be higher at approximately weekday business hours (such as monday through friday at 9: 00 am to 6:00 pm), while the demand will be less at times outside of these hours. Thus, patterns associated with geographic location (residential versus industrial/commercial) and time constraints can be considered when setting thresholds for anomaly detection or parameter discretization. For example, if hourly statistics and quantiles are calculated in conjunction with commercial and residential areas, their different usage patterns may change the combined distribution in a way that makes outliers more difficult to detect.

To set the threshold to correspond to geographical location and time constraints, the operator (user) may manually mark the area where each eNB is located. However, the task of manually marking each eNB is very time consuming and requires manual input each time a new site is added to the network. Alternatively, tagging each eNB may be performed automatically. For example, the root cause analysis framework 400 of FIG. 4a may identify patterns based on statistics on a one-day, one-week, or even monthly level. Based on the identified patterns, stations that follow similar temporal statistics may be grouped together. Identifying patterns may be performed by clustering or classification techniques. Once a site is classified based on its statistical pattern, it can be grouped with sites in the same class in order to compute long-term statistics for anomaly detection and parameter discretization.

Based on the speed at which the group's statistics change, the window size used to compute the statistics for anomaly detection and KPI discretization can be adjusted accordingly. For example, if residential areas typically observe the same traffic demand within 6 hours, the KPI discretization algorithm can use a larger time window to calculate quantile levels. The use of a large time window with constant flow requirements provides the ability to achieve the quantile level of accuracy that may be required for anomaly detection. For example, to obtain a 0.999 quantile, 1000 samples are required. If another level of precision is required for the discretization algorithm, another order of magnitude is required for the sample count. If the local site group is small, there may not be enough samples to achieve these required accuracies in only one hour. By adjusting the window length based on the site region classification, the algorithm can ensure that it has enough samples to achieve the required level of accuracy.

The mapping between geographic location, time of day, and KPI discretization threshold can be stored in a database. This mapping may be used during root cause analysis by root cause analysis framework 400.

Table 3 below shows example KPIs that may be used during RCA for uplink integrity associated with KQI anomalies.

TABLE 3

Table 4 below shows example KPIs that may be used during RCA for downlink integrity associated with KQI anomalies.

TABLE 4

Preprocessing the PM data further includes deriving synthetic PM data. The integrated PM data is derived by combining different KPIs. Combining different KPIs provides specific insight into the raw PM data that is not yet available. The following examples describe various comprehensive PMs that can be used to pinpoint a root cause of an IP throughput-related anomaly.

The first example integrated PM is referred to as high resource utilization. The CPU load, memory load, and disk utilization per rack and slot of the processing unit at the eNB may provide insight into the overall health of the network where resource utilization is high. For example, high CPU load may limit the eNB's ability to provide downlink and uplink traffic to users, resulting in a reduction in perceived quality of service due to reduced IP throughput and increased IP latency at the end user.

The second example integrated PM is referred to as an uplink power headroom ratio. The uplink power headroom ratio specifies a ratio of a count of power headroom reports during the latest RI when the eNB receives a Power Headroom Report (PHR) with an index equal to or lower than 18 to the total number of PHR (between 0 and 63) received in the entire reporting range. PHR is a quantitative measure of the available power headroom (PH, measured in dB), defined as the maximum transmit power P of the UE_c，maxWith its instantaneous transmission power (downlink path loss PL estimated therefrom, its nominal PUSCH transmission power P_O，PUSCHPartial path loss parameter alpha and number of allocated resource blocks M_PUSCH，RBTo determine) the difference between. Equation (9) describes the power headroom.

Formula (9)

PH＝P_c，max-[P_0，PUSCH+α·PL+10*log₁₀M_PUSCH，RB]

In equation (9) above, the PHR index j indicates that the UE's power headroom is within the interval j-23dB ≦ PH < j-22 dB. For example, index 0 indicates that the power headroom is between-23 dB to-22 dB, while index 63 indicates that PH exceeds 40 dB.

Equation (10) describes the uplink power headroom ratio. As described in equation (10), a ratio near 1 indicates that a large number of UEs are power limited (i.e., maximum transmission power is used for PUSCH transmission), indicating that the cell of interest has uplink coverage issues.

Formula (10)

A third example integrated PM is referred to as a weighted power headroom. The weighted power headroom provides that reports from different users can be weighted in proportion to their number of occurrences. Equation (11) below describes the drive weighted power margin.

Formula (11)

The fourth example integrated PM is referred to as the uplink scheduler MCS ratio. The uplink scheduler MCS ratio is the ratio of the resource block cumulative count with uplink MCS value between 0 and 5 (inclusive) and the resource block cumulative count with uplink MCS value between 0 and 28 (inclusive) during the most recent RI. The corresponding KPI was named ulshulermcsratio, as shown in formula (12) below. A ratio close to 1 indicates that a large number of users are provided with an uplink MCS value equal to or lower than 5 (corresponding to QPSK modulation symbols with a low coding rate), resulting in fewer payload bits transmitted per scheduling opportunity. This indicates that the uplink throughput is low. It should be noted that equation (12) is only calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (12)

A fifth example integrated PM is referred to as an uplink low SINR ratio. The low SINR ratio is the ratio of the cumulative count of uplink SINR values in the range of [ -10dB, 2dB ] (before outer loop compensation) during the most recent RI to the cumulative count of uplink SINR values in the range of [ -10dB, 30dB ] (before outer loop compensation). The corresponding KPI is named ULlowSinrRatio and is described in formula (13) below. A cell with a ullowsinratio value close to 1 has a large proportion of users experiencing uplink SINR values equal to or below 0 dB. Note that equation (13) is calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (13)

The sixth example composite PM is referred to as a weighted uplink SINR. The weighted uplink SINR is the weighted uplink SINR during the latest RI calculated using uplink SINR measurements (before or after outer loop compensation). Equation (14) below describes the weighted uplink SINR

Formula (14)

The seventh example integrated PM is referred to as an uplink reception MCS ratio. The uplink reception MCS ratio is a ratio of an accumulated count during the latest RI when receiving a PUSCH with an MCS value between 0 and 5 (inclusive) and an accumulated count when receiving a PUSCH with an uplink MCS value between 0 and 28 (inclusive). The corresponding KPI is named ulreceivedsmcsratio and is described in equation (15) below. A ratio close to 1 indicates that a large number of users are provided with an uplink MCS value equal to or lower than 5, resulting in fewer bits to transmit per scheduling opportunity. This indicates that the uplink throughput is low. Note that equation (15) is calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (15)

The eighth example integrated PM is referred to as the downlink scheduler MCS ratio. The downlink scheduler MCS ratio is the ratio of the cumulative count of resource blocks within the most recent RI with downlink MCS values between 0 and 5 (inclusive) to the cumulative count of resource blocks within the downlink MCS values between 0 and 28 (inclusive). The corresponding KPI is named dlschduler mcsratio and is described in formula (16) below. A ratio close to 1 indicates that a large number of users are provided with a downlink MCS value equal to or lower than 5, resulting in fewer bits transmitted per scheduling opportunity. This indicates that the downlink throughput is low. Note that equation (16) is calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (16)

The ninth example integrated PM is referred to as a weighted downlink MCS. The weighted downlink MCS is a weighted average of MCS values obtained by considering the number of times each MCS level is used during a time interval resulting in the most recent RI. The corresponding KPI is named WeightedDLSchedulMcs and is described in formula (17) below. Note that equation (17) is calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (17)

The tenth example integrated PM is referred to as a weighted transmission downlink MCS. The weighted transmission downlink MCS is described in equation (18) below and includes a weighted average of the modulation and coding schemes used for PDSCH transmission. A small value of the weighted downlink MCS indicates poor radio conditions, resulting in a reduced downlink IP throughput. The calculation may be based on a histogram count for the number of occurrences of each MC level, or on the number of resources (time domain, frequency domain) allocated to each MCs level. Other embodiments may calculate this quantity averaged over the codeword and spatial layer domains. Note that equation (18) is calculated when the denominator is greater than zero, otherwise the ratio is marked with a special value such as-1, null, etc.

Formula (18)

The eleventh example integrated PM is referred to as a weighted downlink CQI. The weighted downlink CQI is described in equation (19) and is calculated by weighted average channel quality, where the weighting is proportional to the histogram count for each CQI level between 0 and 15. A small value of the weighted CQI indicates poor radio conditions, resulting in a reduced downlink IP throughput. In other embodiments, the weighting is proportional to the number of radio resources (time, frequency, and spatial) dedicated to the transmission of each CQI level.

Formula (19)

The twelfth example integrated PM is referred to as downlink control channel utilization. The downlink control channel utilization is described in equation (20) and is calculated by a weighted average of Control Channel Element (CCE) sizes used for transmission of downlink control channel information to the end users. CCE sizes vary between 1, 2, 4 and 8, with larger CCE sizes typically used for users with poorer radio quality. Thus, a weighted average equal to or exceeding 4 indicates poor downlink radio quality. Note that other embodiments may include statistics of CCE allocation failure percentages for scheduling downlink allocations, uplink grants, etc.

Formula (20)

The thirteenth example integrated PM is referred to as control format utilization. The control format utilization is used to determine the control channel overhead (1, 2 or 3 OFDM symbols), where a larger overhead means a reduced resource availability for the data channel. In certain embodiments, the CFI ratio may be calculated by the ratio of the histogram count with a CFI value of 3 to the total number of counts, as described in equation (21) below.

Formula (21)

Preprocessing the PM data may also include generating a KPI hierarchy. During each RI, each eNB reports a large number of KPIs. The received KPIs provide detailed information about the health and status of the network between the previous RI and the current RI. The received KPIs may span different protocol layers.

A KPI tree hierarchy refers to the hierarchical relationship between KPIs. FIG. 5b illustrates a flowchart depicting different KPIs (e.g., KPIs)₁、KPI₂、KPI₃、KPI₄And KPI₅) Example KPI tree hierarchy 520 of example hierarchical relationships between. The different KPIs of fig. 5b belong to a certain service class. The root node 521 is the KQI of interest. The service class KQI defines the root node or layer 0 KPI. Each KPI of layer i provides fine grained information about the eNB status. In some embodiments, the tree may or may not be a binary tree. The hierarchy may be obtained by a human domain expert or by a mix of machine learning and EDA tools (e.g., RAPIDMINER).

The paths linking each parent node to the child nodes of the KPI tree hierarchy 520 are mutually exclusive. For example, node 522 is a parent node of node 524, and node 524 is a child node of node 522. Each child node in the tree corresponds to a KPI or KPIs that satisfy some inequality constraint with a confidence that exceeds a minimum confidence threshold. In certain embodiments, the minimum confidence threshold is 85%. KPIs located higher in the tree (such as node 522 compared to node 524) are associated with symptoms of an anomaly. The KPIs at the bottom of the tree provide a root cause explanation for the anomaly.

In some embodiments, the KPI tree hierarchy may include more than the two levels shown in KPI tree hierarchy 520. For example, a deeper tree can provide more detailed explanations, such as listing more signs and root causes associated with a KQI anomaly.

For example, the first level of the tree corresponds to the rules at nodes 522 and 523: the rules at node 522 are based on relationships

And the rules at node 524 are based on the relationship KPIs₁≥T₁，

The second level of the KPI tree hierarchy 520 provides a root cause explanation. For example, node 524 provides the current KPI₁＜T₁，

Root cause explanation of the time. Node 526 provides the current KPI₁＜T₁，KPI₂＜T₂，

Root cause explanation of the time. Node 527 provides the current KPI₁≥T₁，KPI₃≤T₃，KPI₄＞T₄，

Root cause explanation of the time.

Fig. 5c illustrates a flow diagram 530 for constructing a KPI tree hierarchy in accordance with various embodiments of the present disclosure. In certain embodiments, RCA framework 400 of FIG. 4a generates a KPI tree hierarchy, such as KPI tree hierarchy 520.

Once generated, the KPI hierarchy tree explains the underlying root cause and associated symptoms that accompany the anomaly. Typically, the first step comprises deriving a rule comprising a form of

Is constrained by a minimum confidence. First, while the underlying rules may provide high confidence, the accompanying KPIs may or may not be the root cause(s) of the anomaly. For example, the downlink will be lowRules that associate traffic capacity with IP throughput anomalies may provide high confidence, but the underlying root cause-leading to low traffic capacity-is either due to poor radio conditions at the cell user, or due to no cell accessibility or due to insufficient traffic demand at the cell. Finding the underlying root cause requires a deep knowledge of the events that lead to the root cause. One way to do this is by linking (e.g., using a logical term AND (AND)) antecedents belonging to one or more base rules to form more complex rules. By linking the rules together, the overall confidence of the linked rules is improved. The chaining rules also help to generate insight into causal relationships that lead to IP throughput anomalies. Second, under certain specific scenarios, various singleton rules do not perform well. For example, ground rules that provide high confidence

Possibly provided with a second event KPI₂＜t₂Is a low confidence of the condition. For example, if equation (22) below corresponds to a first confidence level and equation (23) below corresponds to a second confidence level, the overall confidence level may be high on average, but a single term such as

The confidence of (c) may be relatively low.

Formula (22)

Formula (23)

Formula (24)

To generate a KPI tree as described in flow diagram 530,

a set of reported KPIs defined to be associated with a KQI anomaly. That is, for each KQI(s), AN_KQI1, where s ∈ A_KQI. The goal is to derive a rule set R ═ R_iI1, 2. For each rule R ∈ R, let

A set of KPI data representing all reports that satisfy a rule antecedent. KPIs can be discretized or continuously valued. In some embodiments, it is assumed that KPIs are integers that range in value from 1 to M. Note that the rule sets within (i) R are mutually exclusive. That is, each PM sample S ∈ S satisfies either a zero rule within R or a rule within R, (ii) each R_iE R has a confidence that exceeds a predetermined threshold, such as 85%, and (iii) maximizing the overall hit rate HR is based on

Identifying a set of KPIKs { KPIs _l1 ≦ L, such that each KPI exceeds the threshold with confidence is the ground rule (i.e.,

) A part of (a). In one method of deriving base rules, a machine learning algorithm (such as machine learning training model 414 of FIG. 4 a) such as association rule mining is applied. In some embodiments, a human expert may determine the exemplary set of KPIs to associate a KPI value or range of values with a KQI anomaly. In other embodiments, a combination of rule mining and human expert domain knowledge may also be applied.

In one embodiment, each combination of KPIs may be cascaded

And evaluating confidence of the chain rule toForming a chain rule. If the rule does not meet the desired confidence threshold, it is discarded. In certain embodiments, if a rule does not meet a desired confidence threshold, then the rule and all branches originating from the rule are discarded if q KPIs-q are algorithm or design parameters-linked to the rule that do not meet the confidence threshold.

In an embodiment, if a rule does not meet the desired confidence threshold, it is retained until the subsequent branch (obtained by linking additional KPIs to the rule) achieves the necessary target confidence. Such embodiments may include configurable parameters to control the number of different KPIs linked to the rule if the rule itself does not meet the desired confidence threshold.

An alternative embodiment includes running a rule mining algorithm on progressively smaller data sets. First, as with previous methods, rule mining is used on large data sets (1 or more days) to create ground rules. Once the ground rules are found and ranked using one of the ranking methods described above, the data set is reduced to include only samples for which the ground rules are true. Rule mining is then performed again on the smaller data set to discover second order rules. This process runs in the order of each primitive rule. Once the second order rules are found, they can be sorted and the process can be run again. This process will continue until no higher order rule is found that matches the minimum confidence requirement.

In yet another alternative embodiment, the composite rule is generated iteratively. Flowchart 530 describes this embodiment. For example, the first step includes ranking the KPI group K in order of importance scores, from high to low. The importance scores may be assigned to the KPIs with the highest hit rate starting from the highest importance and the lowest importance to the KPIs with the lowest hit rate. In embodiments of the invention, human domain experts may rank KPIs in order of their perceived qualitative importance. For example, the highest importance KPI may be set to the availability KPI (or the percentage of time that the cell is available during the time interval that results in the most recent reporting interval), the next highest KPI may be set to the accessibility KPI, and so on.

For example, the processor evaluates the confidence level of each of the following M basic rules:

among them, KPI1 is the KPI with the highest importance score among K. The M base rules and their confidence values are stored in R. For L located within 2 and L and each R located within R, the rule and all chained rules derived from the rule are discarded from R if the following two conditions are met. A first condition is satisfied when the confidence of the rule is below a minimum threshold. The second condition is met when the confidence of the linking rule obtained by linking (all combinations of) the first q KPIs (in order of importance) to the rule is below a minimum threshold. Here, q is a configurable parameter between 1 and L-1 (inclusive). For L within 2 and L and for each M between 1 and M (inclusive), the processor passes the rule KPI_lM links to r to evaluate the confidence of the linking rule. When the chain rule meets the desired confidence threshold, the chain rule and its confidence value are added to R. Notably, pruning is carried out

To generate a KPI tree hierarchy, a set of rules is identified (step 532). In some embodiments, the rules may be identified from the machine learning training model 414 of FIG. 4 a. In other embodiments, the rules are created by a human expert. In certain embodiments, the machine learning training model 414 of FIG. 4a is combined with a human expert to identify rules. The rules associate KPI values or value ranges with KQI anomalies.

In step 534, a set of KPIKs is identified (collected) { KPIs _l1 ≦ L ≦ L, such that each KPI is a ground rule with a confidence that exceeds a threshold (i.e.,

) A part of (a).

In step 536, the set of KPIs is assigned an importance score. The importance scores are based on a ranking, where the highest importance is assigned to the KPI with the highest hit rate, and the lowest importance is assigned to the KPI with the lowest hit rate. For example, the highest importance KPI may be set to the availability KPI (or the percentage of time that the cell is available during the time interval that results in the most recent reporting interval), while the next highest KPI may be set to the accessibility KPI, and so on. In other embodiments, the human domain expert may order the KPIs in order of their perceived qualitative importance. In step 538, the KPIs are organized into a list-based. The order of the KPIs is in descending order of their importance scores. In certain embodiments, the KPI list is stored in a rules database.

In step 540, for each rule, the linked (or daisy-chained) KPIs are iterated starting with the KPI of highest importance to form the changed rule. In some embodiments, after KPIs are iteratively linked, rules with confidence scores below a threshold are removed. Rules are removed after KPI iterative changes, because additional KPIs are subsequently linked to the rules, the confidence of the chained rules may exceed a desired threshold.

In step 542, the rule and all chained rules derived from the rule are removed from the tree if (i) the confidence score of the rule is below a threshold and (ii) the confidence score of the chained rule obtained by linking the previous KPI to the rule is below a threshold. For example, starting with each leaf node j, the processor checks whether the support, confidence and hit rate of the composite rule that terminates at j are the same as those obtained using the composite rule that terminates at j's parent node. All leaf nodes that are TRUE above are discarded. In step 544, the chaining rules are stored in memory. In some embodiments, a root cause explanation is generated for each chaining rule.

Although fig. 5c shows one example of a flow chart 530 for constructing a hierarchy of KPI trees, various changes may be made to fig. 5 c. For example, while shown as a series of steps, various steps in FIG. 5c could overlap, occur in parallel, or occur any number of times. As another example, a similar process may be used to create a hierarchical tree for alert-based rules.

Fig. 5d illustrates an example output 550 of a KPI tree hierarchy according to various embodiments of the present disclosure. For the example of a given tree 552, a rule such as rule 554 may be associated therewith.

Preprocessing may also include jointly processing the warning and PM data. Note that the warning is based on an event triggered by malfunction at the eNB, and the PM data is based on each cell and received at a specific interval. Warnings are critical to troubleshooting and diagnostics that can lead to reduced network KQI hardware and software problems. Thus, the ability to interpret and predict the occurrence of a PM anomaly based on alerting the eNB to occur is provided by jointly processing the alerts and PM data.

In certain embodiments, to extract information about the relationship between PM anomaly occurrences and warning occurrences, a time-domain correlator processes the time-stamped PM anomaly data and warning data to determine time instances at which they occur together or in close proximity to each other.

In some embodiments, correlations are presented on a user interface that depict a distribution of the number of occurrences of different types of alerts associated with each exception category. For example, the user interface includes an analysis that provides a warning and a time distribution of the occurrence of the anomaly for a particular duration of time (e.g., the past hour, day, etc.). The end user can use this information to determine the most important alerts that have a service level impact and further determine the root cause of the alert occurrence.

In certain embodiments, a causal relationship exists between multiple alerts. For example, if warning a is active at time T ═ T, the causal relationship may indicate the probability of whether warning B is also active at T ═ T. For another example, if warning a is activated at time T ═ T, the causal relationship may indicate the probability of whether warning B is activated at a pre-specified time before warning a occurs. Similarly, if warning a is activated at time T-T, the causal relationship may indicate the probability of whether warning B will be activated at a future pre-specified time (after warning a occurs).

The RCA framework 400 of fig. 4a may use the correlation between alerts to derive information such as whether the occurrence of a low severity alert may be used to predict the occurrence of a high severity alert within a certain time interval in the future.

In some embodiments, anomalies whose timestamps lie outside a certain time interval may be filtered out when associating a warning occurrence with an anomaly occurrence. Thus, the remaining anomalies will have at least one warning (e.g., a single RI) within the time interval, as shown in fig. 5 e. Thereafter, the number of alarm occurrences within 1 RI of the KQI abnormality was counted. For example, fig. 5e shows a graph 560 for correlating KQI anomalies and warning data, in accordance with various embodiments of the present disclosure. As shown in graph 560, the first 4 warnings have 2 anomalies within one RI, while the remaining 8 warnings have only 1 anomaly within 1 RI in which they occur. When a temporal correlation between the alert and the KQI anomaly is identified, the alert occurrence provides a root cause explanation for the occurrence of the KQI anomaly. In some embodiments, the user interface may display a warning and related output of a KQI anomaly over a time frame (e.g., a twenty-four hour period).

That is, the processor orders and normalizes the occurrence of an alert with the occurrence of a KQI anomaly within a particular time interval of the alert (such as one reporting interval). KQI anomalies whose timestamps are outside of an alert interval (such as a reporting interval) are not ordered with alerts.

In certain embodiments, RCA framework 400 of fig. 4a may use machine learning to identify rules including alerts and one or more KPIs as antecedents and KQI anomalies as results. For example, PM data associated with an alert may cause a KQI anomaly at a later point in time. By correlating PM data with alert data, RCA framework 400 of fig. 4a is enabled to proactively detect KQI anomalies based on alert occurrences.

After the ordering regularization, rules are identified using machine learning algorithms (such as association rule mining, random forest). The rules can adopt

In the form of (1). That is, each time a warning occurs, in the case where the confidence is greater than the threshold, it indicates that a KQI abnormality has occurred.

The alert data includes (i) a timestamp that triggers the alert and (ii) a timestamp that clears the alert. The alert is cleared by a software routine that detects that the initial problem has corrected itself (possibly due to a system restart), or by a technician working to fix the underlying problem. Regardless of the manner in which the alert is cleared, the time at which the alert is cleared is useful information for determining the root cause of the problem. For example, significant problems often have warnings that persist for long periods of time or that recur soon after clearance. Thus, embodiments of the present disclosure consider PM reports within 1 RI of the start time to the purge time to be relevant to the warning. In certain embodiments, a correlation measurement may be performed between the PM metric and the alert of interest. PM metrics with high correlation coefficients (normalized correlation coefficient size close to 1) are identified as being relevant to the warning.

Fig. 5f illustrates a graph 565 of temporal ordering regularization of alerts and PM data, according to various embodiments of the present disclosure. Graph 565 depicts a timeline 566a that increases to the right in time. Each marker 566b indicates a received PM sample. The PM samples may be received periodically at some predetermined time interval 567 (e.g., every fifteen minutes). For example, the placements of each marker 566b are fifteen minutes apart along the timeline 566 a. Marker 568a identifies the start of the alert and marker 566b indicates the time at which the alert is cleared. The processor correlates PM data on range 569. The range 569 begins a predetermined time before the warning begins (as shown by marker 568 a) and ends when the warning ends (as shown by marker 568 b). In some embodiments, the predetermined time may be 1 RI before the warning begins. Thus, the processor is able to correlate PM data before and during the warning.

Once the ordering regularization of the PM data and each alert type is complete, it is useful to know whether an alert occurred as a good predictor or as a sign of a KQI anomaly. One way to do this is to define rules

That is, the rule declares that a KQI exception exists whenever an alert is active. The confidence of this rule (as described in equation (7)) refers to the score of the warning sample that is also a KQI anomaly. If the score is close to 1, the rule is reliable, so the rank-regularized warning is accompanied by a KQI anomaly. In some embodiments, a machine learning algorithm (such as association rule mining, decision trees, etc.) is applied to the set of sort rulesThe alarm (time stamped) and the corresponding KQI anomaly (taking the value 0 or 1) are validated. The machine learning algorithm identifies a set of alerts whose occurrence causes a KQI anomaly with high confidence and stores it in a database or memory.

Fig. 5g shows a flowchart 570 for processing alerts and PM data using historical data, according to various embodiments of the present disclosure. Flow diagram 570 depicts processing the alert and the PM data using the historical data to generate a rule based on a derived relationship between the PM data and the alert. Flowchart 570 may be performed by RCA framework 400 of fig. 4 a.

In step 572, RCA framework 400 loads PM data and alert data from historical data 410 of fig. 4 a. In step 574, RCA framework 400 sorts the regularized (consolidated) warning data and PM data. For example, RCA framework 400 identifies the start time and end time of a single alert (alert clear time). RCA framework 400 then identifies a subset of PM data that occurred during the alert. Since the warning time is a fixed time, RCA framework 400 identifies a subset of PM data that occurs during the warning. In certain embodiments, RCA framework 400 identifies PM data occurring during the alert and PM data occurring at a predetermined time interval prior to the alert. RCA framework 400 may identify a subset of PM data for the identified warning data.

In step 576, RCA framework 400 generates rules based on the subset of PM data and corresponding warning data. The rules include warnings or KPIs as antecedents, and KQI anomalies as outcomes. The rules may be generated by the machine learning training model 414 of FIG. 4 a. In step 578, the generated rule is stored in a memory (such as memory 360 of FIG. 3a or memory 380 of FIG. 3 b). In some embodiments, only rules with confidence values above a threshold are stored in memory.

Although FIG. 5g shows one example of a flow chart 570 for processing alerts and PM data using historical data, various changes may be made to FIG. 5 g. For example, while shown as a series of steps, various steps in FIG. 5g could overlap, occur in parallel, or occur any number of times.

Rule mining algorithms often have difficulty creating rules for sets of items that appear less frequently. This creates problems when using rule mining to find the relationship between the warning and the anomaly, since anomalies are not as common as normal operations, and warnings are even less common than anomalies. To facilitate the success of the rule miner, embodiments of the present disclosure order-regularize alerts with exceptions rather than with all reported samples. This exaggerates the correlation between rare alarm events, making it easier for the rule mining algorithm to detect anomalies. Further, the rules may be validated by testing all discovered rules against historical batch data from multiple days and multiple locations.

In certain embodiments, correlating the alert and PM data may occur in real time as compared to the historical data depicted in FIG. 5 h. Fig. 5h illustrates a flow diagram 580 for real-time sequencing of regularization and joint processing of alerts and PM data, according to various embodiments of the present disclosure.

To associate the warning and the PM data in real time, the warning of the eNB is first received (step 582). At step 584, the processor marks the most recent PM data as being regularized with the warning time ordering. The processor then maintains a store of which enbs currently have active alarms and which alarms are active. Before receiving the warning clear signal, the processor may mark all PM data received from the eNB as being ordered with the warning signal. For example, the memory may store for each warning a list of PM metrics associated with the warning signal based on correlation measurements derived from historical (batch) data.

Upon receiving the warning signal, the processor loads from memory the stored PM metrics identified as being relevant to the warning and marks the data of these PM metrics as being regularized with the warning signal ordering. In step 586, the processor loads a rule containing an alert as a antecedent and a resulting KQI exception. For example, the loaded rule includes at least one antecedent including a warning time and a result including a KQI

In step 588, the processor confirms that the warning occurrence is accompanied by the occurrence of a KQI anomaly. After confirming that the alert occurrence is accompanied by the occurrence of a KQI anomaly, the processor will declare an alert occurrence in step 590. Then, when a PM metric satisfies at least one rule that contains the resulting KQI anomaly, the corresponding relevant PM metric is considered to be the root cause of the KQI anomaly. In some embodiments, the processor then displays the warning and KPI within the premises of the rule as the root cause of the KQI anomaly.

For example, the processor will load a rule containing an alert as a antecedent and a resulting KQI exception. If the alert occurrence is also accompanied by a KQI anomaly, the processor may declare that the alert occurred and the corresponding PM metric is the root cause of the KQI anomaly occurrence.

Although FIG. 5h illustrates one example of a flow chart 580 for real-time sequencing regularization and joint processing of alerts and PM data, various changes may be made to FIG. 5 h. For example, while shown as a series of steps, various steps in FIG. 5h could overlap, occur in parallel, or occur any number of times.

FIG. 6 illustrates a process 600 for generating a root cause explanation in accordance with various embodiments of the present disclosure. Process 600 may be performed by RCA framework 400 of fig. 4 a. The embodiment of fig. 6 is for illustration only. Other embodiments may be used without departing from the scope of this disclosure.

The process 600 includes a batch processor 602 similar to the batch layer 406 of FIG. 4 a. Batch processor 602 includes knowledge base 604, machine learning engine 606, and filter 610. The batch processor 602 receives historical PM data and alert data to generate rules. The generated rules are used to diagnose the root cause. The process 600 also includes a memory 612 that stores the generated rules. In some embodiments, the memory stores the generated rules above a confidence threshold. Process 600 also includes a rule discovery engine 614 that processes real-time exception data 616. Upon discovering a rule that is satisfied by the real-time exception data 616, a root cause explanation 620 is output. When a rule is not found, post-processing engine 624 attempts to generate a new rule that is stored in memory 612.

The knowledge base 604 maintains various categories for diagnosing uplink and downlink throughput anomalies. The first category is for degraded uplink (coverage hole). Signs or effects of degraded uplink (coverage holes) include a large proportion of users transmitting at maximum power, high MCS < ═ 5 for uplink use (QPSK for low coding rate), and insufficient uplink traffic scheduling. Throughput anomalies belonging to the degraded uplink class are due to coverage holes resulting in poor radio quality on the uplink.

The second category is based on degraded uplink (RF interference). Signs or effects of degraded uplink (RF interference) include high uplink RSSI and interference power, high usage of uplink MCS ≦ 5 (low coding rate QPSK) and uplink traffic scheduling insufficiency. Throughput anomalies that fall into this category are due to high levels of uplink Radio Frequency Interference (RFI). One method for detecting these anomalies is for the composite KPI ullowsinratio to take a value close to 1 for the anomaly samples, since a high RFI on the uplink will cause the number of uplink SINR values to increase, equal to or below 2 dB. In another approach for detecting these anomalies, the ratio metric, ULSchedulMcsRatio, takes a value close to 1 for the anomalous PM samples. This is because high RFI on the uplink results in a large proportion of MCS values in the range between 0 and 5 (corresponding to QPSK modulation at low coding rate).

The third category is based on degraded downlink (RF interference). Signs or effects of degraded downlink (RF interference) include reported downlink channel quality less than 4, high usage of uplink MCS less than or equal to 5 (QPSK for low coding rate) and insufficient uplink traffic scheduling. Throughput anomalies that fall into this category are due to high levels of downlink Radio Frequency Interference (RFI). Degraded downlink can result in the loss of uplink scheduling grants carried over the Physical Downlink Control Channel (PDCCH). Due to the loss of the scheduling grant, the UE will not transmit its uplink data as indicated by its serving cell. Degraded downlink may also result in lost RLC-AM acknowledgements (generated in response to RLC packets on the uplink) and possibly lost TCP acknowledgements. One method for detecting such anomalies is that the weighted downlink channel quality obtained from KPI WeightedDLReceivedCQI can be used to evaluate the downlink radio channel quality. The value of WeightedDLReceivedCQI for an abnormal sample tends to be equal to or lower than 4, while the KPI value for a normal sample is between 0 and 15. Another method for detecting such anomalies is for the ratio metric dlschedulermcsrito to take a value close to 1 for the anomalous PM samples. The reason is that a large proportion of MCS values in the range between 0 and 5 (corresponding to QPSK modulation with a low coding rate) will generally result due to high RFI on the downlink.

The fourth category is based on a degraded H-ARQ block error rate. The symptoms or effects of this category include an uplink H-ARQ block error rate of greater than 50% and a downlink H-ARQ block error rate of greater than 50%. Throughput anomalies belonging to this category are due to high block error rates in the uplink and/or downlink.

The fifth category is based on low uplink traffic demand or cell unavailability. In this category, symptoms or effects include uplink PRB utilization near 0, possible availability KQI anomalies, low traffic capacity and few uplink active users. Anomalies belonging to this RCA category indicate a low demand for uplink traffic. They are accompanied by low PRB utilization, a small number of uplink active users, and possible availability anomalies.

The sixth category is based on low downlink traffic demand or cell unavailability. In this category, symptoms or effects include downlink PRB utilization near 0, possible availability KQI anomalies, and low traffic capacity and a low number of downlink active users. Anomalies belonging to this RCA category indicate a low demand for uplink traffic. They are accompanied by low PRB utilization, low number of downlink active users and possible availability anomalies

The batch processor 602 loads historical (time stamped) data from the historical data 410 a. Historical data 410a may be the same as historical data 410 of fig. 4 a. The historical data 410a includes KPIs and reported alert occurrences for each RI. The output of the batch processor 602 may be a continuous value or a discrete value. The PM data is tagged by the KQI service class based on the output from the anomaly detector 412a, indicating whether an anomaly is observed for that class (i.e., an anomaly is detected for the KQI class m,

otherwise 0).

The machine learning engine 606 (such as association rule mining, decision trees, etc.) generates one or more KPIsOr alert rules associated with the KQI anomaly. The rule refers to a relationship that relates a result (occurrence of a KQI abnormality) to a cause. The machine learning engine 606 may apply machine learning algorithms to the pre-processed PM data to generate a set of rules that relate KPIs (or antecedents) to the KQI (results) in question. An example of a rule associating two KPIs A and B with a KQIC may be

For RCA, the antecedent cause corresponds to one or more KPIs as the root cause; whenever the KQI is abnormal, the result is a boolean variable equal to true. In some embodiments, the rules are generated by rule mining.

The filter 610 uses equations (6), (7), and (8) above to identify the generated rules that satisfy the various thresholds by the machine learning engine 606. For example, equation (7) describes the confidence that a portion of the anomaly sample complies with the rule. The hit rate of equation (8) describes the portion of the KQI anomaly associated with the rule. Once the rules are generated, after design goals such as minimum support and confidence thresholds are met, they are output from the batch processor 602 and stored in the memory 612. For example, the generated rule may be compared to a confidence threshold and the rule saved to memory 612 based on the comparison.

Rule discovery engine 614 receives real-time exception data 616. The real-time anomaly data 616 may include input stream KPI (one sample per eNB per RI) data and warning data that is regularized with the timestamp ordering of PM anomaly samples.

The data is labeled for each RI according to whether a KQI anomaly and an anomaly class exist. For each exception sample, rule discovery engine 614 examines the rules from memory 612 and identifies an exception sample corresponding to one or more of the rules.

Upon discovering the rule, in decision 618, a corresponding root cause explanation 620 is provided on the graphical user interface. Providing a root cause explanation 620 may trigger a set of remedial actions, such as adjusting a remote electrical tilt at the eNodeB.

When the rule discovery engine 614 does not discover, the process 600 may perform post-processing of the PM and warning data to identify new rules at decision 618. Upon identifying the new rule 626, the new rule is stored in the memory 612.

Although FIG. 6 shows one example of a process 600 for generating a root cause explanation, various changes may be made to FIG. 5. For example, while shown as a series of steps, various steps in FIG. 6 could overlap, occur in parallel, or occur any number of times.

Fig. 7 illustrates an example decision tree 700 for associating root cause analysis for a certain KQI anomaly according to various embodiments of the present disclosure. The embodiment of fig. 6 is for illustration only. Other embodiments may be used without departing from the scope of this disclosure.

Decision tree 700 is a flow chart-like tool in which each internal node "tests" attributes under conditions. Each branch represents the result of the test, and the path from the root to the leaf represents a rule. Decision tree 700 identifies rules related to RCA for a particular KQI anomaly. To generate a decision tree, such as decision tree 700, the processor identifies various parameters, such as maximum tree depth, minimum samples of splits, and minimum samples of leaves. The processor first generates N trees. Next, the processor initializes rule set R { }. : for each tree N ∈ N, the processor defines a set P: set of paths to the leaf with the KQI anomaly. For each path P ∈ P, the processor converts P to rule R, such that R ← R £ R { R }. For each R ∈ R, the processor calculates a confidence, support, and hit rate, and discards R if a minimum threshold is not met (based on the confidence, support, and hit rate).

FIG. 8 illustrates an example method 800 for discovering and diagnosing network anomalies. The method 800 may be performed by any one of the eNB of fig. 1, the eNB 102 of fig. 3b, the UE of fig. 1, the UE116 of fig. 3a, or the RCA framework 400 of fig. 4 a. For ease of explanation, method 800 is described as being performed by an electronic device (such as UE116 or a server of fig. 3 a) that includes RCA framework 400 of fig. 4 a. The embodiment of fig. 8 is for illustration only. Other embodiments may be used without departing from the scope of this disclosure.

In step 802, the electronic device extracts features based on samples obtained by discretizing the KPI data and the warning data. In step 804, the electronic device generates a rule set based on the feature. The rules may be generated instantaneously or based on historical data. The rules indicate the interpretation of the exception. That is, the rules may be applied to derive a root cause explanation for anomalies in the wireless cellular network. The rules may include one of a left-hand side (LHS) and a right-hand side (RHS). A set of KPIs and thresholds may be applied to LHS and RHS. It is noted that if the KPI in the RHS satisfies a threshold on the RHS, it corresponds to an anomaly sample, whereas if the KPI in the LHS satisfies a threshold on the LHS and if the KPI in the RHS corresponds to an anomaly sample, the value of the KPI in the LHS provides a possible symptom or root cause explanation.

In step 806, the electronic device identifies a first sample of the samples as a normal sample or an abnormal sample. The samples may be real-time KPIs and warning data. When the first sample is identified as an anomalous sample, the electronic device identifies a first rule of the set of rules corresponding to the anomaly of the first sample in step 808. The first rule indicates the signs and root cause of the anomaly. In step 810, the electronic device applies the root cause to derive a root cause explanation for the anomaly. Root cause explanation may be based on KPIs associated with the root cause and symptoms of anomalies included in the sample. In step 812, the electronic device performs a corrective action based on the first rule to resolve the anomaly.

Although FIG. 8 illustrates one example of a method 800 for discovering and diagnosing network anomalies, various changes may be made to FIG. 8. For example, while shown as a series of steps, various steps in FIG. 8 could overlap, occur in parallel, or occur any number of times.

Although the figures show different examples of user equipment, various changes may be made to the figures. For example, the user device may include any number of each component in any suitable arrangement. In general, the drawings do not limit the scope of the present disclosure to any particular configuration. In addition, although the figures illustrate an operating environment in which the various user device features disclosed in this patent document may be used, these features may be used in any other suitable system.

Although the present disclosure has been described with exemplary embodiments, various changes and modifications may be suggested to one skilled in the art. The present disclosure is intended to embrace such alterations and modifications as fall within the scope of the appended claims.

Claims

1. An apparatus for discovering and diagnosing network anomalies, the apparatus comprising:

a communication interface configured to receive Key Performance Indicator (KPI) data and warning data; and

a processor operatively connected to the communication interface, the processor configured to:

extracting features based on samples obtained by discretizing the KPI data and the warning data,

generating a set of rules based on the features, wherein a portion of the samples that satisfy a rule correspond to an anomaly,

the sample is identified as either a normal sample or an abnormal sample,

in response to identifying the sample as an anomalous sample, identifying one or more rules satisfied by the sample,

for one or more rules satisfied by the samples identified as anomalous samples, deriving KPIs associated with symptoms and root causes,

deriving a root cause explanation for an anomaly included in the sample identified as an anomaly sample based on KPIs associated with symptoms and root causes of the anomaly; and

performing a corrective action based on the one or more rules to resolve the anomaly.

2. The apparatus of claim 1, wherein the processor is further configured to:

receive an indication of an alert from an eNB via the communication interface;

in response to receiving the indication, identifying a set of KPI data from an eNB received between the start of the alert until the alert is cleared;

determining whether a Key Quality Indicator (KQI) anomaly is associated with the alert; and

identifying the set of KPI data and the alert as a root cause of a KQI anomaly when the KQI is associated with the alert.

3. The apparatus of claim 1, wherein to generate the set of rules, the processor is configured to:

identifying a set of KPI data occurring between a predetermined time before the warning begins and the warning clearing; and

generating a new rule identifying the anomaly based on the warning data associated with the warning and the set of KPI data.

4. The apparatus of claim 1, wherein:

the apparatus also includes a memory configured to store historical KPI data and historical warning data; and is

To generate the set of rules, the processor is further configured to:

new rules are generated from a set of historical KPI data and alerts of historical alert data,

deriving a confidence and a hit rate for the new rule, wherein the confidence is based on scores of samples satisfying the new rule that are identified as anomalous samples and the hit rate is based on scores of anomalous samples satisfying the new rule,

comparing the confidence and hit rate of the new rule with a confidence threshold and hit rate threshold, respectively, and

storing the new rule when a confidence associated with the new rule exceeds a confidence threshold and a hit rate associated with the new rule exceeds a hit rate threshold.

5. The apparatus of claim 1, wherein:

to discretize the KPI data, the processor is configured to:

selecting a set of features based on the KPI data,

deriving additional features based on a composite KPI feature derived from said set of features according to said KPI data, an

Discretizing the set of features comprising KPI data and the composite KPI feature; and is

The processor is further configured to:

combining discretized KPI data with the warning data, and

generating a KPI hierarchy tree based on a portion of the set of rules, wherein the portion of the set of rules in the KPI hierarchy tree provides a root cause explanation for a first anomaly.

6. The apparatus of claim 1, further comprising a display,

wherein the processor is further configured to display on the display at least one of:

the one or more rules corresponding to the exception,

one or more KPI data associated with a symptom and root cause of the anomaly,

the root cause explanation of the abnormality, and

a corrective action to resolve the anomaly.

7. The apparatus of claim 1, wherein to identify the one or more rules, the processor is configured to:

identifying a portion of the rule set associated with the anomaly for which a confidence score is above a threshold;

dividing the portion of the set of rules into a ground rule group and a non-ground rule group;

when a confidence score of a rule pair within the non-base set of rules is above a threshold, including a directed edge between the rule pair, wherein the directed edge indicates a relationship between the rule pair and a rule within the base set of rules; and

identifying a first set of KPI data associated with a first rule of one or more rules within the base set of rules as not including an incoming edge as a root cause, and identifying a second set of KPI data associated with any remaining rules of the one or more rules as a symptom.

8. The apparatus of claim 1, wherein the processor is further configured to:

sorting the historical data into a plurality of groups based on the time stamps;

identifying parameters associated with the plurality of groups; and

predicting future anomalies based on the parameters.

9. A method for discovering and diagnosing network anomalies, the method comprising:

receiving Key Performance Indicator (KPI) data and warning data;

extracting features based on samples obtained by discretizing the KPI data and the warning data;

generating a set of rules based on the features, wherein a portion of the sample that satisfies a rule corresponds to an anomaly;

identifying the sample as a normal sample or an abnormal sample;

in response to identifying the sample as an anomalous sample, identifying one or more rules satisfied by the sample;

deriving KPIs associated with the symptoms and root causes for one or more rules satisfied by the samples identified as anomalous samples;

10. The method of claim 9, further comprising:

receiving an indication of an alert from an eNB;

11. The method of claim 9, further comprising:

12. The method of claim 9, further comprising:

storing historical KPI data and historical warning data;

generating new rules from a set of historical KPI data and alerts of historical alert data;

deriving the new rule confidence and hit rate, wherein the confidence is based on the scores of the samples that satisfy the new rule that are identified as anomalous samples and the hit rate is based on the scores of the anomalous samples that satisfy the new rule;

comparing the confidence and hit rate of the new rule with a confidence threshold and a hit rate threshold, respectively; and

13. The method of claim 9, wherein:

discretizing the KPI data comprises:

selecting a set of features based on the KPI data,

deriving additional features based on a composite KPI feature derived from said set of features according to said KPI data, and

The method further comprises the following steps:

combining discretized KPI data with the warning data; and

14. The method of claim 9, further comprising displaying on the display at least one of:

the one or more rules corresponding to the exception,

one or more KPI data associated with a symptom and root cause of the anomaly,

the root cause explanation of the abnormality, and

a corrective action to resolve the anomaly.

15. The method of claim 9, further comprising: