CN111970157A

CN111970157A - Network fault root cause detection method and device, computer equipment and storage medium

Info

Publication number: CN111970157A
Application number: CN202010881226.2A
Authority: CN
Inventors: 温子将
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-11-20
Anticipated expiration: 2040-08-27
Also published as: CN111970157B

Abstract

The application discloses a network fault root cause detection method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring state information of a target link of audio and video transmission, wherein the state information is used for representing operation data of at least one operation state of the target link; dividing the operation data into a positive sample number set and a negative sample number set according to a preset marking rule; calculating a dimension index of each operation dimension of the target link based on the positive sample number set and the negative sample number set, wherein the dimension index is used for representing influence factors of each operation dimension on the operation state of the target link; and determining the aggregation dimension influencing the running state of the target link according to the dimension index. By the method, error reasons influencing the transmission efficiency of the network transmission link can be quickly positioned, the positioning efficiency is improved, and the labor cost is saved.

Description

Network fault root cause detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of network transmission, and in particular, to a method and an apparatus for detecting a network fault root cause, a computer device, and a storage medium.

Background

The network live broadcast absorbs and continues the advantages of the internet, the online live broadcast is carried out by utilizing a video mode, the contents such as product display, related conferences, background introduction, scheme evaluation, online investigation, conversation interview, online training and the like can be released to the internet on site, and the popularization effect of the activity site is enhanced by utilizing the characteristics of intuition, quickness, good expression form, rich contents, strong interactivity, unlimited region, divisible audience and the like of the internet.

The network live broadcast has large data flow and high requirement on transmission real-time performance, and the requirement on the network stability of a service architecture is high, however, the network live broadcast is limited to the current situation of the global communication network infrastructure, and in some network live broadcast practical application processes, such as live broadcast activities, teaching activities, conference activities and the like in a live broadcast platform, activities often cannot be normally held due to some exceptions. Once an anomaly occurs, parties supported by the network typically troubleshoot the problem by data associated with the anomaly, and related solutions are also endless.

Specifically, in a live network application scene, when a large-range abnormality occurs, if the pause rate suddenly increases, an alarm can be generated based on the core index of statistics of an audio and video audience terminal at present, but the inventor of the application finds that in the current alarm mechanism, only a prompt effect is played, the reason cannot be located immediately, generally, only manual investigation can be performed, and the time consumed in the process is serious.

Disclosure of Invention

The application provides a network fault root cause detection method, and correspondingly also provides a network fault root cause detection device, computer equipment and a storage medium.

In order to solve the technical problem, the following technical scheme is adopted in the application:

one of the objectives of the present application is to provide a method for detecting a network failure root cause, which includes:

acquiring state information of a target link of audio and video transmission, wherein the state information is used for representing operation data of at least one operation state of the target link;

dividing the operation data into a positive sample number set and a negative sample number set according to a preset marking rule;

calculating a dimension index of each operation dimension of the target link based on the positive sample number set and the negative sample number set, wherein the dimension index is used for representing influence factors of each operation dimension on the operation state of the target link;

and determining the aggregation dimension influencing the running state of the target link according to the dimension index.

In a further embodiment, the acquiring the state information of the target link of the audio/video transmission includes:

acquiring abnormal alarm information of a target link;

and acquiring the state information of the target link according to the abnormal alarm information.

In a preferred embodiment, the marking rules include: enumerating value flags and threshold flags;

when the running data is a non-continuous variable, the dividing the running data into a positive sample number set and a negative sample number set comprises:

dividing the running data into a positive sample number set and a negative sample number set according to a preset enumeration value;

when the operational data is a continuous variable, the dividing the operational data into a positive sample number set and a negative sample number set comprises:

and dividing the operation data into a positive sample number set and a negative sample number set according to a preset first threshold value.

In an optional embodiment, the calculating the dimension index of each operation dimension of the target link includes:

counting positive sample numbers and negative sample numbers corresponding to the positive sample number set and the negative sample number set in each operation dimension;

calculating an evidence weight value of each running dimension according to the positive sample number and the negative sample number;

calculating the information value of each operation dimension according to the evidence weight value;

and accumulating at least one information value of the same operation dimension in a preset time period by taking the operation dimension as a limiting condition to generate a dimension index corresponding to each operation dimension.

In a preferred embodiment, the operation data includes continuous variables, and the counting the positive and negative samples in the positive and negative sample sets corresponding to the operation dimensions includes:

according to a preset box separation rule, carrying out discretization processing on continuous variables in the operating data to convert the continuous variables into a plurality of boxes;

and respectively counting the positive sample number and the negative sample number in the positive sample number set and the negative sample number set of the plurality of bins.

In a further embodiment, the determining the aggregative dimension affecting the operation status of the target link according to the dimension index includes:

determining the operation dimension of which the dimension index is larger than a preset second threshold value in all the operation dimensions as a target operation dimension;

sorting at least one information value corresponding to the target operation dimension in a descending order by taking the numerical value of the information value as a sorting condition;

and determining the target operation dimension with the maximum information value and the corresponding evidence weight value larger than zero as an aggregative dimension according to the arrangement result.

In a preferred embodiment, the determining the aggregation dimension affecting the operation status of the target link according to the dimension index includes:

determining a root node corresponding to the aggregative dimension, wherein the root node is a physical node for generating the running data corresponding to the aggregative dimension;

and outputting the label information of the root cause node.

The present application provides a network failure root cause detection apparatus for solving the above technical problem, which includes:

the acquisition module is used for acquiring state information of a target link of audio and video transmission, wherein the state information is used for representing operation data of at least one operation state of the target link;

the marking module is used for dividing the running data into a positive sample number set and a negative sample number set according to a preset marking rule;

a processing module, configured to calculate a dimension index of each operation dimension of the target link based on the positive sample number set and the negative sample number set, where the dimension index is used to characterize an influence factor of each operation dimension on the operation state of the target link;

and the analysis module is used for determining the aggregative dimension influencing the running state of the target link according to the dimension index.

Optionally, the network failure root cause detecting apparatus further includes:

the first acquisition submodule is used for acquiring abnormal alarm information of a target link;

and the first processing submodule is used for acquiring the state information of the target link according to the abnormal alarm information.

Optionally, the marking rule includes: enumerating value markers and threshold markers, wherein the network fault root cause detection device further comprises:

the first marking sub-module is used for dividing the operation data into a positive sample number set and a negative sample number set according to a preset enumeration value when the operation data is a discontinuous variable;

and the second marking submodule is used for dividing the operation data into a positive sample number set and a negative sample number set according to a preset first threshold when the operation data is a continuous variable.

the first statistic submodule is used for counting the positive sample number and the negative sample number corresponding to the positive sample number set and the negative sample number set in each operation dimension;

a first calculating submodule, configured to calculate an evidence weight value of each running dimension according to the positive sample number and the negative sample number;

the second calculation submodule is used for calculating the information value of each operation dimension according to the evidence weight value;

and the second processing submodule is used for accumulating at least one information value of the same operation dimension in a preset time period by taking the operation dimension as a limiting condition to generate a dimension index corresponding to each operation dimension.

Optionally, the operation data includes a continuous variable, and the network fault root cause detection apparatus further includes:

the third processing submodule is used for carrying out discretization processing on the continuous variables in the operating data according to a preset box dividing rule and converting the continuous variables into a plurality of boxes;

and the second counting submodule is used for counting the positive sample number and the negative sample number in the positive sample number set and the negative sample number set of the plurality of sub-boxes respectively.

the first determining submodule is used for determining the operation dimension of each operation dimension, of which the dimension index is larger than a preset second threshold value, as a target operation dimension;

the first sequencing submodule is used for performing descending sequencing on at least one information value corresponding to the target operation dimension by taking the numerical value of the information value as a sequencing condition;

and the first judgment submodule is used for determining the target operation dimension with the maximum information value and the corresponding evidence weight value larger than zero as the aggregative dimension according to the arrangement result.

a fourth processing submodule, configured to determine a root node corresponding to the aggregation dimension, where the root node is a physical node that generates running data corresponding to the aggregation dimension;

and the first execution submodule is used for outputting the label information of the root cause node.

The present application provides a computer device, which includes a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the network failure root cause detection method.

The present application provides a non-volatile storage medium for solving the above technical problems, which stores a computer program implemented by the network failure root cause detection method, and when the computer program is called by a computer, the computer program executes the steps included in the method.

The beneficial effects of the embodiment of the application are that:

the method comprises the steps of collecting operation data representing an operation state in a network transmission link, and dividing the collected operation data into a positive sample number set and a negative sample number set according to a marking rule, wherein the positive sample number set records abnormal operation data, and the negative sample number set records normal operation data; and calculating the dimension index of each operation state according to the positive sample number set and the negative sample number set, and defining the dimension index with larger influence on the network transmission link as an aggregation dimension, wherein the channel number of the operation state corresponding to the aggregation dimension is an error factor influencing the network transmission link. By the method, error reasons influencing the transmission efficiency of the network transmission link can be quickly positioned, the positioning efficiency is improved, and the labor cost is saved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic basic flow chart of a network fault root cause detection method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating the collection of operational data according to early warning information according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a process of calculating a dimension index for each operation dimension by using a positive sample number set and a negative sample number set according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating data binning according to a binning rule according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating the determination of an aggregative dimension according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating an output root cause node according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating a basic structure of a network failure root cause detection apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of a basic structure of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, a "terminal" includes both devices that are wireless signal receivers, devices that have only wireless signal receivers without transmit capability, and devices that have receive and transmit hardware, devices that have receive and transmit hardware capable of performing two-way communication over a two-way communication link, as will be understood by those skilled in the art. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "terminal" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "terminal" used herein may also be a communication terminal, a web-enabled terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, etc.

Referring to fig. 1, fig. 1 is a basic flow chart of the network fault root detection method according to the embodiment.

As shown in fig. 1, a method for detecting a network failure root cause includes:

step S1100, collecting state information of a target link of audio and video transmission, wherein the state information is used for representing operation data of at least one operation state of the target link;

in a specific Network audio and video live broadcast scene, an integrated audio and video transmission link includes a process from a main broadcast equipment end to a spectator equipment end, audio and video streams are collected from the main broadcast end and uploaded to an avp (Attribute-Value Pair, data identification name) server, then mixed picture transcoding is performed, and finally the audio and video streams are delivered to the spectator end through a Content Delivery Network (CDN) and other channels, so that the integrated audio and video transmission is formed.

In the above scenario, the transmission terminal of data involves: the system comprises a main broadcasting device, a server side and a viewer side. When audio and video data are transmitted among the three devices, no matter the serial line is blocked or delayed at any device position or in a network transmission link, the audio and video transmission is blocked or delayed.

The target link comprises: the system comprises a main broadcasting device, a server side, a viewer side and network links among the devices.

And in the audio and video data transmission process, acquiring the state information of the target link in a real-time or timing mode.

In the normal use process, the reasons for causing the data transmission of the target link to be blocked or delayed are different, so that state information of different dimensions needs to be acquired during data acquisition. For example, the operation data affecting the transmission efficiency of the target link, such as acquisition delay, data accumulation delay, packet loss rate, end-to-end delay or rendering delay, etc. Wherein each type of operational data represents an operational state of the target link.

Generally, to facilitate subsequent analysis, the operational data also includes device information of the physical node that generated the state information.

Step S1200, dividing the operation data into a positive sample number set and a negative sample number set according to a preset marking rule;

the step is to realize the division of the positive sample number set and the negative sample number set of the collected running data of the running state of each dimension in the target link.

For example, in the case of video seizure, firstly, whether each piece of audio/video stream stored in real time is abnormal is determined, if the collected operation data is a discontinuous variable, an enumerated value of the operation data is generally classified into a binary variable through a business meaning, if the operation data is stuck, a positive state mark and a negative state mark of whether the operation data is stuck are finally converted, if the operation data is the continuous variable, the operation data is generally classified into the positive state mark and the negative state mark through an alarm threshold value, if the video connection time delay is greater than the threshold value, the time delay is too high, otherwise, the operation data is normal, and particularly, the abnormal data is generally changed into positive sample data.

From the above example, when performing the division of the positive sample number set and the negative sample number set, we can obtain the following by the labeling rule: enumerating value marks or threshold marks, and marking the collected operation data in the same type but different time domains. The marking is convenient mathematically, the abnormal operation data is marked as positive sample data, and the normal data is marked as negative sample data. For example, morton is marked as positive sample data and normal operation is marked as negative sample data. And forming a number set by the positive sample data or the negative sample data of a plurality of different time domains, and generating the positive sample number set and the negative sample number set of the operation data.

Step 1300, calculating a dimension index of each operation dimension of the target link based on the positive sample number set and the negative sample number set, wherein the dimension index is used for representing influence factors of each operation dimension on the operation state of the target link;

the different types of operation data represent different operation dimensions of the target link, namely, one operation dimension corresponds to each operation state. Therefore, the dimension index of each operation dimension can be calculated based on the positive sample number set and the negative sample number set corresponding to the operation data of each operation dimension.

Firstly, carrying out box separation on continuously changed data in each positive sample number set and each negative sample number set, wherein the purpose of the box separation is to convert the continuously changed data into discontinuous variables by discretizing the continuously changed data by adopting an equal frequency method or based on business definition. Marking X _ i as a certain sub-box after X variable discretization, wherein i belongs to [ 1.. times.n ], performing positive and negative sample collection based on X _ i, marking the positive sample number of each sub-box as bad _ i, marking the negative sample number as good _ i, calculating an evidence weighted value of each sub-box as WOE _ i ═ ln ((bad _ i/bad _ all)/(good _ i/good _ all)), calculating the information value of each sub-box as IV _ i ═ WOE _ i ((bad _ i/bad _ all) - (good _ i/good _ all)), summing IV _ i meeting the condition, marking IV _ i as a dimension index corresponding to the operation dimension, wherein the dimension index of each operation dimension represents the influence factor of the operation dimension on the target link normal operation, the greater the value of IV, the greater the impact on the proper operation of the target link. It should be noted that the filtering conditions for WOE _ i and IV _ i are different in different embodiments, for example, the filtering condition for WOE _ i can be less than 1, 2, 3 or any other real number, and the total number of positive and negative samples is less than 1, 2, 3 or any other real number of IV _ i.

And step S1400, determining an aggregative dimension influencing the running state of the target link according to the dimension index.

And after calculating the dimension indexes of all the operation dimensions, determining the maximum value of the dimension indexes as the aggregative dimension in a sequencing mode. In some embodiments, the dimension index which has the largest influence on the target link is screened out according to the descending order of the IV values, and the dimension index with the largest IV order and the information value larger than 0 is selected as the aggregative dimension.

When the warning information appears in the target link, it indicates that a problem occurs in a certain link of the target link, and the transmission data of the target link cannot reach the conventional standard, at this time, data of each operation dimension in the target link needs to be collected, the dimension index of each operation dimension is calculated, and then the aggregative dimension is calculated in the dimension index. At this time, the aggregative dimension is a dimension index corresponding to the determined problematic root node. Because when carrying out operation data acquisition, can gather the equipment information of its corresponding equipment, consequently, can confirm the equipment information of corresponding equipment through the aggregative dimension, send the label of this equipment information to the operation and maintenance personnel, make things convenient for them to confirm the root reason node fast, the accurate troubleshooting.

The method comprises the steps of collecting operation data representing operation states in a network transmission link, and dividing the collected operation data into a positive sample number set and a negative sample number set according to a marking rule, wherein the positive sample number set records abnormal operation data, and the negative sample number set records normal operation data. And calculating the dimension index of each operation state according to the positive sample number set and the negative sample number set, and defining the dimension index with larger influence on the network transmission link as an aggregation dimension, wherein the channel number of the operation state corresponding to the aggregation dimension is an error factor influencing the network transmission link. Similarly, the method can also be used for determining the aggregation dimensionality of the CDN, the machine room and the like to which the fault belongs. By the method, error reasons influencing the transmission efficiency of the network transmission link can be quickly positioned, the positioning efficiency is improved, and the labor cost is saved.

In some embodiments, to save network resources, the system collects the operation data of each operation dimension in the target link only when a certain index in the target link exceeds an early warning value and the system sends out warning information. Referring to fig. 2, fig. 2 is a schematic flow chart illustrating the operation data collection according to the warning information in the embodiment.

As shown in fig. 2, step S1100 includes:

step S1110, acquiring abnormal alarm information of the target link;

in a normal audio and video transmission target link, various index parameters of the running state in the target link are collected, the collected index parameters are compared with an early warning threshold value of the index, and when the index parameters exceed the early warning threshold value or an early warning interval, an abnormal warning message is sent out by a monitoring system.

For example, when the target link's stuck rate suddenly increases, an abnormal alarm message is generated. The types of the indicators capable of causing the abnormal alarm information are not limited to these, and according to different application scenarios, in some practical manners, the types of the indicators capable of causing the abnormal alarm information include but are not limited to: average delay time, packet loss rate, transmission rate and other index types.

Step S1120, collecting the state information of the target link according to the abnormal alarm information.

And after receiving the abnormal alarm information sent by the monitoring system, acquiring the state information of the target link of the audio and video transmission according to the step S1100.

After receiving the abnormal alarm information sent by the monitoring system, the state information of each running state of the target link is collected, so that the network resource occupancy rate caused by collecting the state information in real time can be reduced.

In some embodiments, the display rate of network resources is further reduced and the computing efficiency is improved. And through historical data statistical analysis, root cause nodes or reasons causing different abnormal alarm information are established, and then a mapping list between the abnormal alarm information type and the root cause nodes is established. And after the abnormal alarm information occurs, extracting the type of the abnormal alarm information, and then acquiring a possible root cause node causing the abnormality according to the type. Finally, the operation data of the possible root cause node causing the abnormity is extracted in a targeted manner to carry out network fault root cause detection. Because, in the embodiment, before data acquisition, the root cause node is screened once, and then data acquisition is performed, the data volume of the acquired operation data can be reduced, the operation efficiency is improved, the reason causing abnormal alarm information can be determined more quickly, and the troubleshooting efficiency is improved.

In some embodiments, different marking rules need to be used for marking for different types of operational data.

When the operation data is a discontinuous variable, the operation data is divided into a positive sample number set and a negative sample number set according to a preset enumeration value. Enumerated values define an ordered set by predefining identifiers that list all values, the order of which is consistent with the order of the identifiers in the enumerated type specification. For example, the running data indicating whether the target link is stuck is a non-continuous variable, and when marking is performed, we define the enumerated value as [1, 1 ], where-1 indicates no stuck and 1 indicates stuck. The running data in the stuck state in a continuous time, for example, 10 minutes, are all converted into binary variables by the enumerated values.

When the operating data is a continuous variable, the operating data is divided into a positive sample number set and a negative sample number set according to threshold markers. If the video continuous-microphone time delay is larger than a preset threshold value, the time delay is too high, otherwise, the time delay is normal, and particularly, an abnormal data set is generally constructed to be a positive sample number set, and a normal data set is constructed to be a negative sample number set.

In some embodiments, it is desirable to compute a dimension index for each operating dimension from the positive and negative sample number sets. Referring to fig. 3, fig. 3 is a schematic flow chart illustrating a process of calculating a dimension index of each operation dimension according to the positive sample number set and the negative sample number set in the present embodiment.

As shown in fig. 3, step S1300 includes:

step 1310, counting the positive sample number and the negative sample number corresponding to the positive sample number set and the negative sample number set in each operation dimension;

and counting the positive sample number and the negative sample number in the positive sample number set and the negative sample number set corresponding to each operation dimension. Wherein, the positive sample number refers to the sum of the positive sample numbers in the positive sample number set, and the negative sample number refers to the sum of the negative sample numbers in the negative sample number set.

In some embodiments, for the operation dimension in which the operation data is a continuous variable, the data of the continuous variable in the positive sample number set and the negative sample number set needs to be subjected to binning processing when the step is performed. The purpose of the binning processing is to convert continuously changing data into non-continuous variables by discretizing the continuously changing data by adopting an equal frequency method or based on business definition.

Step S1320, calculating an evidence weight value of each operation dimension according to the positive sample number and the negative sample number;

marking X _ i as a positive sample number or a negative sample in a positive sample number set and a negative sample number set corresponding to a certain sub-box or non-continuous variable operation data after X variable discretization, wherein i belongs to [ 1.·, n ], performing positive and negative sample summarization based on X _ i, marking the positive sample number of each sub-box X as bad _ i, marking the negative sample number as good _ i, and calculating an evidence weight value of each sub-box, and marking WOE _ i as ln ((bad _ i/bad _ all)/(good _ i/good _ all).

Step S1330, calculating the information value of each operation dimension according to the evidence weight value;

after the dimension indexes of each operation dimension are obtained through calculation, the information value of each sub-box is calculated and is marked as IV _ i ═ WOE _ i ((bad _ i/bad _ all) - (good _ i/good _ all)).

In some embodiments, the parameters involved in the operation need to be filtered before only calculating the information value of each running dimension, for example, only the running dimension whose evidence weight value is <0 and the total number of positive and negative samples is less than the threshold value is filtered to calculate the information value. And the data are screened before calculation, so that the calculation data amount is reduced, and the calculation efficiency is improved.

Step S1340, accumulating at least one information value of the same operation dimension in a preset time period by taking the operation dimension as a limiting condition, and generating a dimension index corresponding to each operation dimension.

And calculating to obtain information values corresponding to the operation dimensions, and accumulating at least one information value of the same operation dimension in a preset time period by taking the operation dimension as a limiting condition. For example, once every two minutes, the acquisition of the card-ton operation data in the target link generates a group of positive sample number sets and negative sample number sets, and the calculation of the information value of each group of data of the card-ton operation dimension is started after 10 groups of positive sample number sets and negative sample number sets are continuously acquired. And accumulating the information values obtained by calculating all the positive sample number sets and the negative sample number sets within 20 minutes to generate a dimension index of the Kanton operation dimension.

In some embodiments, when performing the dimension index calculation, it is necessary to filter the calculated entry parameter, and perform information value summation on the operation dimension with the evidence weight value of <0 and the total number of positive and negative samples smaller than the information value of the preset threshold, which is denoted as IV ═ sum (IV _ i).

In some embodiments, when the operation data representing the operation state is a continuous variable, the excessively complex and redundant data may cause complex operation and reduce the calculation efficiency. Therefore, it is necessary to perform binning processing on data in the positive and negative sample sets before counting the number of samples. Referring to fig. 4, fig. 4 is a schematic flow chart illustrating data binning performed according to the binning rule in this embodiment.

As shown in fig. 4, step S1310 includes:

step S1311, according to a preset box separation rule, performing discretization processing on continuous variables in the operation data to convert the continuous variables into a plurality of boxes;

for the operation dimension with the operation data as the continuous variable, the data of the continuous variable in the positive sample number set and the negative sample number set needs to be subjected to binning processing when the step is performed. The purpose of the binning processing is to convert continuously changing data into non-continuous variables by discretizing the continuously changing data by adopting an equal frequency method or based on business definition.

Step 1312, counting the positive and negative samples in the positive and negative sample numbers of the plurality of bins respectively.

And performing box separation on continuously changed data in each positive sample number set and each negative sample number set, wherein the box separation aims to convert the continuously changed data into discontinuous variables by discretizing the continuously changed data by adopting an equal frequency method or based on business definition. And marking X _ i as a certain sub-box after the discretization of the X variable, wherein i belongs to [ 1.. multidot.n ], and positive and negative samples are collected based on the X _ i.

By performing box separation processing on the running data of the continuous variable, the complexity and the redundancy of the continuous data are reduced, and the data processing efficiency is improved.

In some embodiments, after the dimension indexes of each operation dimension are calculated, an aggregation dimension that affects the normal operating state of the target link needs to be determined from the plurality of dimension indexes, and a device represented by the aggregation dimension is a root cause node. Referring to fig. 5, fig. 5 is a schematic flow chart illustrating the determination of the aggregation dimension according to the present embodiment.

As shown in fig. 5, step S1400 includes:

step 1410, determining the operation dimension of the dimension indexes larger than a preset second threshold value in all the operation dimensions as a target operation dimension;

after the dimensionality indexes of all the operation dimensionalities are obtained through calculation, data screening needs to be carried out on the dimensionality indexes once so as to reduce the data volume of subsequent operation, and the operation dimensionality after screening is defined as a target operation dimensionality.

Specifically, in some embodiments, the second threshold is 0.5, i.e., the operation dimension with a value greater than 0.5 is the target operation dimension, and the operation dimension with a value less than or equal to 0.5 is filtered out. It should be noted that the value of the second threshold is not limited to the exemplary value, and in some embodiments, the value of the second threshold can be any number according to different application scenarios.

Step S1420, sorting at least one of the information values corresponding to the target operation dimension in a descending order by using the value of the information value as a sorting condition;

after the data is screened, the target operation dimensions with the numerical values larger than the second threshold value are screened, and each target operation dimension corresponds to at least more than one information value, so that the information values of the screened target operation dimensions need to be sorted according to the numerical values, and specifically, the sorting mode is selected in a descending order. The ordering method is not so limited and in some embodiments the ordering can be in ascending order.

And S1430, determining the target operation dimension with the maximum information value and the corresponding evidence weight value larger than zero as an aggregative dimension according to the arrangement result.

And after sorting, screening the target operation dimension with the highest information value and the corresponding evidence weight value larger than zero in the sorting result as the aggregative dimension. In some embodiments, the aggregation dimension is not limited to one, and in some embodiments, the target operation dimension of TOP2, TOP3, TOP4, or TOP5 meeting the above-mentioned screening condition is defined as the aggregation dimension, wherein the target operation dimension at TOP1 is the main aggregation dimension, and the other target operation dimensions are the candidate aggregation dimensions. The alternative aggregative dimension is used as an alternative scheme of the root cause, so that maintenance personnel can conveniently and quickly determine the root cause from other dimensions when the main aggregative dimension is not the root cause.

In some embodiments, after the aggregative dimension is determined, a corresponding root cause node needs to be output according to the aggregative dimension, so that maintenance personnel can conveniently and rapidly remove faults. Referring to fig. 6, fig. 6 is a schematic flow chart of the output root node according to the present embodiment.

As shown in fig. 6, step S1400 is followed by:

step S1510, determining a root node corresponding to the aggregative dimension, wherein the root node is a physical node for generating running data corresponding to the aggregative dimension;

in this embodiment, when the operation data of the operation state is collected, the device information of the operation device generating the operation data is synchronously collected, and the device information is used as the tag information of the operation data. The operation data is converted into positive and negative sample sets, dimension indexes and aggregative dimension indexes, and each data carries the label information in the calculation process. After the aggregative dimension is determined, corresponding equipment information can be determined by checking label information corresponding to the aggregative dimension.

In this embodiment, the root node is a physical node, and is used to refer to a device that generates the operation data corresponding to the aggregation dimension. Root cause nodes include (without limitation): the system comprises physical nodes such as a live broadcast channel number, a server room, a CDN deployment site or a physical link site. We will collectively refer to the individual devices that make up the channel number as the channel number.

And step S1520, outputting the label information of the root node.

After the root cause node corresponding to the aggregative dimension is determined, the root cause node needs to be output to corresponding maintenance personnel, so that the maintenance personnel can conveniently troubleshoot faults existing in the root cause node. And outputting label information of the root cause node for facilitating maintenance personnel to obtain information during output. The label information includes device information of the device designated by the root node, a channel number of the failed channel, or a specific location and device information of the failed physical node.

The present application may configure a corresponding apparatus by running an application program of each embodiment implementing the foregoing method in a computer, specifically refer to fig. 7, and fig. 7 is a schematic diagram of a basic structure of a network fault root cause detection apparatus according to this embodiment.

As shown in fig. 7, a network failure root cause detection apparatus includes: an acquisition module 2100, a labeling module 2200, a processing module 2300, and an analysis module 2400. The acquisition module 2100 is configured to acquire state information of a target link for audio and video transmission, where the state information is used to represent operation data of at least one operation state of the target link; the marking module 2200 is configured to divide the operation data into a positive sample number set and a negative sample number set according to a preset marking rule; the processing module 2300 is configured to calculate a dimension indicator of each operation dimension of the target link based on the positive sample number set and the negative sample number set, where the dimension indicator is used to characterize an influence factor of each operation dimension on the operation state of the target link; the analysis module 2400 is configured to determine, according to the dimension index, an aggregative dimension that affects the operation state of the target link.

The network fault root cause detection device collects operation data characterizing operation states in a network transmission link and divides the collected operation data into a positive sample number set and a negative sample number set according to a marking rule, wherein the positive sample number set records abnormal operation data, and the negative sample number set records normal operation data. And calculating the dimension index of each operation state according to the positive sample number set and the negative sample number set, and defining the dimension index with larger influence on the network transmission link as an aggregation dimension, wherein the channel number of the operation state corresponding to the aggregation dimension is an error factor influencing the network transmission link. By the method, error reasons influencing the transmission efficiency of the network transmission link can be quickly positioned, the positioning efficiency is improved, and the labor cost is saved.

In some embodiments, the network failure root cause detection apparatus further comprises: the device comprises a first acquisition submodule and a first processing submodule. The first obtaining submodule is used for obtaining the abnormal alarm information of the target link; and the first processing submodule is used for acquiring the state information of the target link according to the abnormal alarm information.

In some embodiments, the tagging rule comprises: enumerating value marks and threshold marks, wherein the network fault root cause detection device further comprises: a first labeling submodule and a second labeling submodule. The first marking sub-module is used for dividing the running data into a positive sample number set and a negative sample number set according to a preset enumeration value when the running data is a discontinuous variable; the second marking submodule is used for dividing the operation data into a positive sample number set and a negative sample number set according to a preset first threshold when the operation data is a continuous variable.

In some embodiments, the network failure root cause detection apparatus further comprises: the device comprises a first statistic submodule, a first calculation submodule, a second calculation submodule and a second processing submodule. The first statistic submodule is used for counting the positive sample number and the negative sample number corresponding to the positive sample number set and the negative sample number set in each operation dimension; the first calculation submodule is used for calculating the evidence weight value of each operation dimension according to the positive sample number and the negative sample number; the second calculation submodule is used for calculating the information value of each operation dimension according to the evidence weight value; and the second processing submodule is used for accumulating at least one information value of the same operation dimension in a preset time period by taking the operation dimension as a limiting condition to generate a dimension index corresponding to each operation dimension.

In some embodiments, the operation data includes continuous variables, and the network fault root cause detection apparatus further includes: a third processing sub-module and a second statistics sub-module. The third processing submodule is used for carrying out discretization processing on continuous variables in the operating data according to a preset box dividing rule and converting the continuous variables into a plurality of boxes; the second counting submodule is used for counting the positive sample number and the negative sample number in the positive sample number set and the negative sample number set of the plurality of sub-boxes respectively.

In some embodiments, the network failure root cause detection apparatus further comprises: the device comprises a first determination submodule, a first sequencing submodule and a first judgment submodule. The first determining submodule is used for determining the operation dimension of the dimension indexes, which are larger than a preset second threshold value, in each operation dimension as a target operation dimension; the first sequencing submodule is used for performing descending sequencing on at least one information value corresponding to the target operation dimension by taking the numerical value of the information value as a sequencing condition; and the first judgment submodule is used for determining the target operation dimension with the maximum information value and the corresponding evidence weight value larger than zero as the aggregative dimension according to the arrangement result.

In some embodiments, the network failure root cause detection apparatus further comprises: a fourth processing submodule and a first execution submodule. The fourth processing submodule is used for determining a root node corresponding to the aggregation dimension, wherein the root node is a physical node for generating running data corresponding to the aggregation dimension; the first execution submodule is used for outputting the label information of the root cause node.

In order to solve the foregoing technical problem, an embodiment of the present application further provides a computer device, configured to run a computer program implemented according to the network failure root cause detection method. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of a computer device according to the present embodiment.

As shown in fig. 8, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can enable the processor to implement a network fault root cause detection method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of network fault root cause detection. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of the acquisition module 2100, the marking module 2200, the processing module 2300 and the analysis module 2400 in fig. 7, and the memory stores program codes and various data required for executing the above modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data necessary for executing all the submodules in the network failure cause detection device, and the server can call the program codes and data of the server to execute the functions of all the submodules.

The computer equipment collects operation data representing operation states in a network transmission link and divides the collected operation data into a positive sample number set and a negative sample number set according to a marking rule, wherein the positive sample number set records abnormal operation data, and the negative sample number set records normal operation data. And calculating the dimension index of each operation state according to the positive sample number set and the negative sample number set, and defining the dimension index with larger influence on the network transmission link as an aggregation dimension, wherein the channel number of the operation state corresponding to the aggregation dimension is an error factor influencing the network transmission link. By the method, error reasons influencing the transmission efficiency of the network transmission link can be quickly positioned, the positioning efficiency is improved, and the labor cost is saved.

The present application also provides a non-volatile storage medium, wherein the network fault root cause detection method is written as a computer program and stored in the storage medium in the form of computer readable instructions, and when executed by one or more processors, the computer readable instructions mean the execution of the program in a computer, thereby causing the one or more processors to execute the steps of the network fault root cause detection method according to any one of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for detecting a network fault root cause is characterized by comprising the following steps:

2. The method of claim 1, wherein determining the aggregation dimension that affects the operational status of the target link according to the dimension indicator comprises:

and outputting the label information of the root cause node.

3. The method according to claim 1, wherein the collecting status information of a target link of audio/video transmission comprises:

acquiring abnormal alarm information of a target link;

4. The method of claim 1, wherein the marking rule comprises: enumerating value flags and threshold flags;

5. The method according to claim 1, wherein the calculating the dimension index of each operation dimension of the target link comprises:

6. The method according to claim 5, wherein the operation data includes continuous variables, and the counting the positive and negative samples in the positive and negative sample sets for each operation dimension comprises:

7. The method according to claim 5 or 6, wherein the determining the aggregation dimension affecting the operation status of the target link according to the dimension index comprises:

8. A network fault root cause detection apparatus, comprising:

9. A computer device comprising a memory and a processor, wherein computer readable instructions are stored in the memory, which computer readable instructions, when executed by the processor, cause the processor to perform the steps of the network fault root cause detection method according to any one of claims 1 to 7.

10. A non-volatile storage medium, characterized in that it stores a computer program implemented by the method for detecting a root cause of a network failure according to any one of claims 1 to 7, which computer program, when invoked by a computer, performs the steps comprised by the method.