CN108683530B - Data analysis method and device for multi-dimensional data and storage medium - Google Patents

Data analysis method and device for multi-dimensional data and storage medium Download PDF

Info

Publication number
CN108683530B
CN108683530B CN201810400910.7A CN201810400910A CN108683530B CN 108683530 B CN108683530 B CN 108683530B CN 201810400910 A CN201810400910 A CN 201810400910A CN 108683530 B CN108683530 B CN 108683530B
Authority
CN
China
Prior art keywords
dimension
data
flow value
value
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810400910.7A
Other languages
Chinese (zh)
Other versions
CN108683530A (en
Inventor
陈云
陈宇
李聪
王博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810400910.7A priority Critical patent/CN108683530B/en
Publication of CN108683530A publication Critical patent/CN108683530A/en
Application granted granted Critical
Publication of CN108683530B publication Critical patent/CN108683530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/0636Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Debugging And Monitoring (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention provides a data analysis method and device for multi-dimensional data and a computer readable storage medium. The data analysis method of the multi-dimensional data comprises the following steps: acquiring a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimension data; inputting the dimension combination of the multi-dimensional data and the normal flow value and the abnormal flow value of the dimension combination into a decision tree, and screening suspected root cause dimensions from the dimension combination of the multi-dimensional data by using the decision tree; calculating the contribution degree of the suspected root cause dimension and the consistency degree of the loss degree of the sub-dimension; and identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree of the suspected root cause dimension and the consistency degree of the loss degree of the sub-dimensions, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss. According to the embodiment of the invention, when a fault occurs, the root cause dimension is quickly analyzed according to the multi-dimensional data of the fault index, the time for operation and maintenance personnel to locate the fault is saved, and the loss caused by the fault is reduced.

Description

Data analysis method and device for multi-dimensional data and storage medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to a method and an apparatus for analyzing multidimensional data, and a computer-readable storage medium.
Background
In order to better understand and analyze the service operation status in real time, internet companies usually attach as many attribute tags as possible, such as UA (User Agent), network system, geographical location, etc., when collecting monitoring data. The labels describe the data from different angles or dimensions, and the description information of different dimensions enables the collected data to have strong expression capability, so that the multi-dimensional data of the collected data is formed.
Currently, the positioning by using multi-dimensional data is mainly to manually check and compare data of different dimensions, and find out dimensions with obvious abnormal degrees from all dimensions. When a fault occurs, the fault is judged manually according to the multidimensional data, a worker needs to have certain experience, and the process consumes longer time because the judgment process needs to look up a trend graph of more data and then comprehensively judge. When the data dimension is large, the positioning time will increase sharply, resulting in large loss due to the failure of quick positioning and loss stopping.
Disclosure of Invention
Embodiments of the present invention provide a data analysis method and apparatus for multidimensional data, and a computer-readable storage medium, so as to at least solve one or more technical problems in the prior art.
In a first aspect, an embodiment of the present invention provides a data analysis method for multidimensional data, including: acquiring a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimension data; inputting a dimension combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimension combination into a decision tree, and screening suspected root cause dimensions from the dimension combination of the multi-dimensional data by using the decision tree; calculating the contribution degree and the sub-dimension loss degree consistency degree of the suspected root cause dimension; and identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension of the suspected root cause dimension, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a normal flow value and an abnormal flow value of each dimension of the multidimensional data includes: monitoring a total flow of the multidimensional data; and if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the obtaining a normal flow value and an abnormal flow value of each dimension of the multidimensional data in the preset time period includes: and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
With reference to the first implementation manner of the first aspect, in a third implementation manner of the first aspect, the acquiring a normal flow value and an abnormal flow value of each dimension of the multidimensional data in the preset time period includes: counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
With reference to the first implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the acquiring a normal flow value and an abnormal flow value of each dimension of the multidimensional data in the preset time period includes: predicting the flow data value of each dimension in the preset time period; and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, and the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the screening out suspected root dimensions using the decision tree includes: taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set; balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state; calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and determining the path of the constructed decision tree as a suspected root dimension.
With reference to the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the balancing positive and negative example sample weights includes: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, and the fourth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the identifying, according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensions of the suspected root-cause dimension, whether the suspected root-cause dimension is a root-cause dimension includes: and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
In a second aspect, an embodiment of the present invention provides a data analysis apparatus for multidimensional data, including: the flow acquiring unit is used for acquiring normal flow values and abnormal flow values of all dimensions in the dimension combination of the multi-dimensional data; the dimensionality screening unit is used for inputting a dimensionality combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimensionality combination into a decision tree, and screening suspected root dimensionality from the dimensionality combination of the multi-dimensional data by using the decision tree; the characteristic calculation unit is used for calculating the contribution degree of the suspected root cause dimension and the consistency degree of the loss degree of the sub-dimension; and the identification unit is used for identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension of the suspected root cause dimension, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss.
With reference to the second aspect, in a first implementation manner of the second aspect, the embodiment of the present invention includes: a monitoring subunit, configured to monitor a total flow of the multidimensional data; and an acquisition subunit for: if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
With reference to the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the obtaining subunit is further configured to: and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
With reference to the first implementation manner of the second aspect, in a third implementation manner of the second aspect, the obtaining subunit is further configured to: counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
With reference to the first implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the obtaining subunit is further configured to: predicting the flow data value of each dimension in the preset time period; and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, and the fourth implementation manner of the second aspect, in a fifth implementation manner of the second aspect, the dimension screening unit is further configured to: taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set; balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state; calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and determining the path of the constructed decision tree as a suspected root dimension.
With reference to the fifth implementation manner of the second aspect, in a sixth implementation manner of the second aspect, the balancing positive and negative example sample weights includes: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, and the fourth implementation manner of the second aspect, in a seventh implementation manner of the second aspect, the identification unit is further configured to: and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
In a third aspect, an embodiment of the present invention provides a data analysis apparatus for multidimensional data, including: one or more processors; storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as in any one of the first aspects above.
In one possible design, the data analysis apparatus for multidimensional data includes a processor and a memory, the memory is used for storing a program for the data analysis apparatus supporting multidimensional data to execute the data analysis method for multidimensional data in the first aspect, and the processor is configured to execute the program stored in the memory. The data analysis device of the multidimensional data can also comprise a communication interface, and the data analysis device of the multidimensional data is used for communicating with other equipment or a communication network.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the first aspect.
The technical scheme has the following advantages or beneficial effects: when a fault occurs, the root cause dimension can be quickly analyzed according to the multidimensional data of the fault index, the time for locating the fault by operation and maintenance personnel is saved, and the loss caused by the fault is reduced.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 is an overall framework diagram of a data analysis method for multi-dimensional data according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating the steps of a preferred embodiment of a method for analyzing multidimensional data according to the present invention;
FIG. 3 shows a schematic diagram of a decision tree of a method of data analysis of multi-dimensional data according to an embodiment of the invention;
FIGS. 4a and 4b are schematic diagrams illustrating a decision tree structure partitioning process of a data analysis method of multi-dimensional data according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a suspected root dimension combination corpus of a method for data analysis of multi-dimensional data according to an embodiment of the invention;
FIG. 6 is an overall block diagram of a data analysis apparatus for multi-dimensional data according to an embodiment of the present invention;
fig. 7 is a block diagram showing a configuration of a data analysis apparatus for multidimensional data according to another embodiment of the present invention;
fig. 8 is a block diagram showing a data analysis apparatus for multi-dimensional data according to another embodiment of the present invention.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.
The embodiment of the invention provides a data analysis method of multi-dimensional data. Fig. 1 is an overall framework diagram of a data analysis method of multidimensional data according to an embodiment of the present invention. As shown in fig. 1, the data analysis method of multidimensional data according to the embodiment of the present invention includes: step S110, acquiring a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimension data; step S120, inputting a dimension combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimension combination into a decision tree, and screening suspected root cause dimensions from the dimension combination of the multi-dimensional data by using the decision tree; step S130, calculating the contribution degree of the suspected root cause dimension and the consistency degree of the loss degree of the sub-dimension; and step S140, identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree and the consistent degree of the loss degree of the sub-dimension of the suspected root cause dimension, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss.
The data analysis method of the multi-dimensional data can be used for finding the root cause dimension from all dimensions when a fault occurs, wherein the root cause dimension is the dimension with obvious abnormal degree. The following are two examples of locating root dimensions in multi-dimensional data.
The first embodiment is as follows: the dimension combination comprises provinces and operators, wherein the operators comprise communication, mobile communication, telecommunication and the like. When service flow has a loss, reading in flow data of each dimensionality during a fault, and quickly positioning the root cause dimensionality according to the flow data of each dimensionality during the fault, for example, the data flow loss of telecommunication is more, then the positioning result is: the root cause dimension with obvious abnormal degree is the operator dimension.
Example two: the dimension combination comprises an operating system, a browser and a mobile communication technology, wherein the operating system comprises an apple, an android and the like; browsers such as google browser, 360 browser, UC browser, etc.; mobile communication technologies such as 3G, 4G, etc. Monitoring the total flow of data after releasing application, judging that a fault occurs when the total flow is damaged, reading in flow data of each dimension during the fault, and quickly positioning root cause dimensions according to the flow data of each dimension during the fault, wherein the positioning result is as follows: if the application has significant traffic loss anomaly when using Google browser, the root cause dimension is browser.
In a particular application, network data traffic may be monitored using traffic monitoring software. When the service flow is damaged, the multi-dimensional data analysis method can be used for quickly positioning the root cause dimension, so that the loss stopping time is shortened, and the fault loss is reduced.
According to an embodiment of the data analysis method of the multidimensional data, acquiring the normal flow value and the abnormal flow value of each dimension of the multidimensional data comprises the following steps: monitoring a total flow of the multidimensional data; and if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
In this embodiment, the total flow of data is monitored, a failure is determined when the total flow is lost, and the flow data values for each dimension at the time of the failure are read. The flow data value comprises a normal flow value and an abnormal flow value, and the flow data value is the sum of the normal flow value and the abnormal flow value. The abnormal flow value of each dimension, i.e. the loss flow data value, needs to be obtained in some way, for example, by collection or prediction.
According to an embodiment of the data analysis method of multidimensional data of the present invention, the obtaining the normal flow value and the abnormal flow value of each dimension of the multidimensional data in the preset time period includes: and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
In such an embodiment, the acquisition of the abnormal flow values for each dimension obtained by way of acquisition includes acquiring the actually occurring flow. The method can calculate the amount of the flow drop according to the actually generated flow, and the calculated amount of the flow drop can be obtained by making a difference value with the flow data value of each dimensionality in a specified time period. For example, the difference between the flow data value of each dimension in the current time period and the flow data value of each dimension in the previous time period may be calculated. Alternatively, the difference between the flow data value of each dimension in the current time period and the flow data value of each dimension in the same time period of the previous day may be calculated. In an alternative embodiment, the difference between the flow data value for each dimension in the current time period and the flow data value for each dimension in the same time period a few days ago may also be calculated, and a number of days in "a few days ago" may be specified, such as a week or a month.
According to an embodiment of the data analysis method of multidimensional data of the present invention, the obtaining the normal flow value and the abnormal flow value of each dimension of the multidimensional data in the preset time period includes: counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
Specifically, the specific method of obtaining the abnormal traffic value of each dimension by the collection method can also count how many requests are not processed, and the number of times of requests that are not processed is the number of times of failed accesses. If the access does not receive the reply message, that is, the access request is not processed, the access is considered to be a failure condition. The number of failed accesses for each dimension may be determined as an abnormal traffic value for the dimension. Similarly, the access of the received reply information is considered as the successful access, and the number of successful accesses of each dimension is determined as the normal flow value of each dimension.
According to an embodiment of the data analysis method of multidimensional data of the present invention, the obtaining the normal flow value and the abnormal flow value of each dimension of the multidimensional data in the preset time period includes: predicting the flow data value of each dimension in the preset time period; and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
The abnormal flow value of each dimension obtained by the prediction mode comprises the following steps: the difference from the collected actual occurring flow is predicted to be an abnormal flow value, i.e., a lost flow, if there is no failed flow. Specifically, the periodic variation rule of the network traffic can be counted, and the traffic data value of each dimension in the current time period can be predicted according to information such as the time period and/or the user browsing behavior pattern. And taking the difference value between the predicted flow data value and the actually acquired flow data value as an abnormal flow value.
Fig. 2 is a flowchart illustrating steps of a data analysis method for multidimensional data according to a preferred embodiment of the present invention. As shown in fig. 2, according to an embodiment of the method for analyzing multidimensional data of the present invention, step S120 in fig. 1, the screening suspected root dimensions using the decision tree includes: step S210, taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set; step S220, balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state; step S230, calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and step S240, determining the path of the constructed decision tree as an suspected root dimension.
A decision tree is a flow-graph-like tree structure in which each internal node (non-leaf node) represents a test on an attribute, each branch represents a test output, and each leaf node holds a class label. Once the decision tree is built, for a tuple without a given class label, a path is traced from the root node to the leaf node, which stores the prediction of the tuple.
In the embodiment of the invention, suspected root cause dimensions are screened out by a process of constructing a decision tree, the input characteristics of the decision tree are accessed dimension combinations, such as provinces and operators, and normal flow values and abnormal flow values thereof, and whether the dimension combinations are positive examples or not is output, namely the suspected root cause dimensions are output; and obtaining a decision tree with better discrimination through model training, thereby obtaining a suspected root cause dimension combination complete set, namely a decision tree path. The suspected root cause dimension can be screened by a process of constructing a decision tree based on a C4.5 algorithm, and the screening of the suspected root cause dimension can reduce the subsequent calculation amount of dimension characteristic calculation and root cause identification.
In step S210, a certain dimension combination d in the multidimensional data is regarded as a sample point, and the access failure times pvlost of the dimension combination ddI.e., the abnormal traffic value, as d is the weight of the normal case setpositive_dNumber of successful accesses pv of dimension combination ddI.e., the normal flow value, as d is the weight of the negative case setnegative_d
According to an embodiment of the method for analyzing multidimensional data, the step S220 of balancing the weights of the positive and negative samples includes: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
In order to satisfy the assumption of screening suspected root cause dimensionality by using the information gain ratio and maximize the initial state information entropy, the weights of positive and negative examples samples need to be balanced so that the weights of the positive and negative examples are equivalent in the initial state. In this embodiment, the final positive example weightpositive_d'=pvlostd*(pvtotal/pvlosttotal) (ii) a Final negative example weightnegative_d'=pvd
For example, when only two dimensions are combined, according to pvlosttotalIs 1, pv total100, pvlostd1Is 1, pvd1Is 10, pvlostd2Is 0, pvd2For the case of 90:
weight of sample point d1positive_d1Is pvlostd1*(pvtotal/pvlosttotal) 100, negative example weightnegative_d1Is pvd1=10;
Similarly, d2 has a positive case weight of 0 and a negative case weight of 90, and has a positive case weight of 100 and a negative case weight of 100 in the initial state as a whole. The initial state information entropy is maximum.
In step S230, the training phase of the decision tree constructs a decision tree from a given training data set. The decision tree may be built based on C4.5 algorithm training. Each division is carried out by screening only one dimension, the information gain rate brought by each dimension is calculated during each division, and the characteristics (namely the dimensions) with the maximum information gain rate and larger than 0 are selected for division. And stopping the generation of the subtree when the entropy gain is negative, so that the calculation of the subtree part is saved, and the node path with a non-negative case in the final generated decision tree is the suspected root dimension, wherein the node path with the non-negative case comprises a non-leaf node.
For example, according to the situation of only two dimensions, the provinces have values of Beijing and Shanghai, and the operators have values of telecom and Unicom. Analyzing the condition of telecommunication abnormality, wherein the telecommunication abnormality can cause that the telecommunication positive example weight (positively correlated with pvlost) is very high and deviates from the equilibrium position, and the information entropy is lower than other relative equilibrium dimensions; the weight of the negative case of the communication is very high, the negative case also deviates from the balance position, the information entropy is low, the information gain rate of the dimensionality of the operator is higher than that of the provincial dimensionality, the operator is selected for division at the moment, the combination of the two dimensionalities of the provincial dimensionality and the provincial dimensionality is not considered, and the information gain rate is the reduction degree of the mean value of the information entropy. By analogy, a group of dimension combinations capable of well distinguishing normal and abnormal can be obtained based on a greedy method, and the pruning effect is obvious.
For another example, according to the situation that only two dimensions exist, the provinces have values of Beijing and Hebei, and operators have values of Unicom and Telecommunications. Table 1 shows the flow data values and weight values of the multidimensional data in this example. Table 1 shows a total of 4 sample points, respectively: sample point d11, beijing unicom; sample point d12, Beijing Telecommunications; a sample point d21, communicated with north and river; sample point d22, Hebei telecom. Sum of abnormal flow values pvlost according to the data in table 1totalA total pv of 100, normal flow valuetotalIs 1000, pvlostd11Is 90, pvd1To 100, it is calculated that: weight of sample point d11positive_d11Is pvlostd11* (pvtotal/pvlosttotal) 900, negative example weightnegative_d11Is pv d1100; similarly, d12 has a positive example weight of 100 and a negative example weight of 80; d21 has a positive case weight of 0 and a negative case weight of 200; d22 has a positive case weight of 0 and a negative case weight of 620.
TABLE 1 flow data values and weight values for multidimensional data
Province of labor Operator Normal flow rate value Abnormal flow rate value Weight of positive case Negative example weight
Beijing Are communicated 100 90 900 100
Beijing Telecommunications 80 10 100 80
Hebei river Are communicated 200 0 0 200
Hebei river Telecommunications 620 0 0 620
Total up to 1000 100 1000 1000
FIG. 3 shows a schematic diagram of a decision tree of a method of data analysis of multi-dimensional data according to an embodiment of the invention; fig. 4a and 4b are schematic diagrams illustrating a decision tree structure partitioning process of a data analysis method of multi-dimensional data according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a decision tree constructed from the sample set data shown in Table 1. The specific partitioning process of the decision tree shown in fig. 3 is illustrated by fig. 4a and 4 b.
Fig. 4a is a schematic diagram of first division of a decision tree. As shown in fig. 4a, the first division is divided into node (2) beijing and node (3) hebei by node (1), i.e. root node, using provinces. Specifically, according to the data calculation of the sample set, i.e., the normal flow value and the abnormal flow value of the sample points d11, d12, d21 and d22 shown in table 1, if the division dimension uses provincial division, the positive/negative example ratio of beijing is 1000/180, and the positive/negative example ratio of hebeijing is 0/820; if the division dimension uses operator division, the plus/minus proportion of the telecom is 100/700, and the plus/minus proportion of the federation is 900/300.
In this embodiment, the decision tree is built based on C4.5 algorithm training. The C4.5 algorithm uses the information gain ratio to select the attributes. The attribute selection metrics are also called splitting rules because they determine how tuples on a given node are split. The attribute selection metric provides a rank rating that each attribute describes a given training tuple, and the attribute with the best metric score is selected as the split attribute for the given tuple. For example, the C4.5 algorithm uses the information gain rate to select the attributes. Many branches reflect anomalies in the training data at decision tree creation, and pruning is a problem that deals with this over-fitting data. Pruning is performed during the decision tree construction process because some nodes with few elements may over-fit the constructed decision tree, which may be better if not considered.
In machine learning and feature engineering, the uncertainty of information can be represented by entropy. For a random variable X taking a finite value, if its probability distribution is:
P(X=xi)=pi,i=1,2,…,n
the entropy of the random variable X can be described by the following formula:
Figure GDA0002904636460000121
for example, if the category identifier is c and the value is c in a classification system1,c2,…,cnAnd n is the total number of classes, then the entropy of this classification system is:
Figure GDA0002904636460000122
the information gain refers to the reduction of entropy, which is the difference between the entropy of the sample set before division and the entropy of the data subset divided by using a certain feature, that is, the information gain brought to the system after a certain feature X is fixed. When the overall distribution of the feature X is fixed, the conditional entropy is H (c | X). Then since the characteristic X is fixed, the information gain brought to the system is: ig (X) ═ H (c) — H (c | X).
The information gain ratio is defined by the aforementioned information gain and the split information metric, i.e., the entropy h (X) of the feature X, then the information gain ratio is:
Figure GDA0002904636460000123
in the first division shown in fig. 4a, the information gain rates after division according to provinces and division according to operators are calculated respectively, and since the information gain rate after division according to provinces is greater than the information gain rate after division according to operators, division according to provinces is selected, so that the node (1) splits the subnode (2) into beijing and the subnode (3) into hebeibeibeibeige.
In the second division shown in fig. 4b, the division of the node (2) and the node (3) is determined by calculation of the information gain ratio, in the same manner as the calculation of the first division. For the node (2), the node is divided according to operators, so that the node (2) is divided into a sub-node (4) which is communicated with the sub-node (5) which is communicated with the Beijing; for node (3), since the information gain rate of operator division is 0, it is not divided. The resulting suspected root dimension combination corpus, i.e., the decision tree path, is shown in fig. 5.
In step S120 in fig. 1, after the suspected root dimension is screened out by using the decision tree, step S130 is executed to calculate the dimension feature value. Two features of all suspected root dimensions are calculated: contribution degree and sub-dimension loss degree consistency degree. The contribution degree can be calculated according to formula 1, and the consistency of the loss degree of the sub-dimension can be measured by the coefficient of variation, as shown in formula 2:
Figure DEST_PATH_IMAGE002
in the above formula, pvlostdFor loss value of dimension d, pvlosttotalIs the loss value of the total dimension. Wherein the loss value is also an abnormal flow rate value.
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE008
In the formula, pvd、pvlostdSuccess number (normal flow value) and failure number (abnormal flow value) of the dimension d; r isdDegree of anomaly in dimension d; dimension t1,t2,t3…tnIs a sub-dimension of dimension d, e.g.: the sub-dimensions of the Beijing dimension are Beijing Unicom, Beijing Mobile, and Beijing Telecommunications.
According to an embodiment of the method for analyzing data of multidimensional data of the present invention, the step S140 of identifying whether the suspected root dimension is the root dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension, includes: and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
In step S140, the contribution degree and the consistency degree of the sub-dimension loss degree of each suspected root dimension are input to a linear two-classifier trained based on historical data to identify the root dimension, and whether the dimension is the root dimension is classified. The training process of the classifier based on the historical data is as follows: and acquiring data during historical fault, and marking each dimension into two types according to whether the dimension is a root cause dimension, wherein 0 is a non-root cause and 1 is a root cause. And calculating two characteristics of each dimension according to the steps, and training by utilizing a machine learning classification algorithm such as a decision tree, logistic regression and the like to obtain a two-classifier.
The multidimensional data analysis method provided by the embodiment of the invention can be used in a fault location scene, and is suitable for any additive multidimensional data analysis. The multi-dimensional data which can be added refers to that the total dimensional data is equal to the sum of all the sub-dimensional data, for example, the operator dimensional data is equal to the sum of data of Unicom, Mobile, telecom and the like.
On the other hand, the embodiment of the invention provides a data analysis device for multi-dimensional data. Fig. 6 is an overall block diagram of the data analysis apparatus for multidimensional data according to the embodiment of the present invention. As shown in fig. 6, the data analysis device for multidimensional data according to the embodiment of the present invention includes: a flow acquiring unit 100, configured to acquire a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimensional data; a dimension screening unit 200, configured to input a dimension combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimension combination into a decision tree, and screen a suspected root dimension from the dimension combination of the multi-dimensional data by using the decision tree; a feature calculating unit 300, configured to calculate a contribution degree of the suspected root dimension and a consistency degree of a loss degree of the sub-dimension; and an identifying unit 400, configured to identify whether the suspected root cause dimension is a root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensions of the suspected root cause dimension, where the root cause dimension is a data dimension corresponding to a root cause causing traffic loss.
Fig. 7 is a block diagram showing a data analysis apparatus for multi-dimensional data according to another embodiment of the present invention. As shown in fig. 7, according to an embodiment of the data analysis apparatus for multidimensional data of the present invention, the flow acquiring unit 100 includes: a monitoring subunit 110, configured to monitor a total flow of the multidimensional data; and an acquisition subunit 120 for: if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
According to an embodiment of the apparatus for analyzing multidimensional data, the obtaining subunit 120 is further configured to: and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
According to an embodiment of the apparatus for analyzing multidimensional data, the obtaining subunit 120 is further configured to: counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
According to an embodiment of the apparatus for analyzing multidimensional data, the obtaining subunit 120 is further configured to: predicting the flow data value of each dimension in the preset time period; and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
According to an embodiment of the apparatus for analyzing multidimensional data of the present invention, the dimension screening unit 200 is further configured to: taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set; balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state; calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and determining the path of the constructed decision tree as a suspected root dimension.
According to an embodiment of the apparatus for data analysis of multidimensional data, the balancing the positive and negative example sample weights comprises: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
Referring to fig. 6, according to an embodiment of the apparatus for analyzing multidimensional data of the present invention, the identification unit 400 is further configured to: and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
The functions of each module in the apparatus according to the embodiment of the present invention may refer to the related description of the above method, and are not described herein again.
In another aspect, an embodiment of the present invention provides a data analysis apparatus for multidimensional data, including: one or more processors; storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above methods of data analysis of multidimensional data.
In one possible design, the data analysis apparatus for multidimensional data includes a processor and a memory, the memory is used for storing a program for the data analysis apparatus supporting multidimensional data to execute the data analysis method for multidimensional data, and the processor is configured to execute the program stored in the memory. The data analysis device of the multidimensional data can also comprise a communication interface, and the data analysis device of the multidimensional data is used for communicating with other equipment or a communication network.
Fig. 8 is a block diagram showing a data analysis apparatus for multi-dimensional data according to another embodiment of the present invention. As shown in fig. 8, the image processing apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the data analysis method of the multi-dimensional data in the above embodiments when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.
The data analysis device for multidimensional data further comprises:
and a communication interface 930 for communicating with an external device to perform data interactive transmission.
Memory 910 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.
In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method described in any of the above embodiments.
The technical scheme has the following advantages or beneficial effects: when a fault occurs, the root cause dimension can be quickly analyzed according to the multidimensional data of the fault index, the time for locating the fault by operation and maintenance personnel is saved, and the loss caused by the fault is reduced.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (18)

1. A data analysis method of multidimensional data is characterized by comprising the following steps:
acquiring a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimension data, wherein the abnormal flow value comprises access failure times, and the normal flow value comprises access success times;
inputting a dimension combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimension combination into a decision tree, and screening suspected root cause dimensions from the dimension combination of the multi-dimensional data by using the decision tree;
calculating the contribution degree and the sub-dimension loss degree consistency degree of the suspected root cause dimension; the contribution degree of the suspected root dimension is the ratio of the abnormal flow value of the suspected root dimension to the abnormal flow value of the total dimension, and the consistency degree of the loss degrees of the sub-dimensions is measured by using a variation coefficient; and
and identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension of the suspected root cause dimension, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss.
2. The method of claim 1, wherein obtaining normal flow values and abnormal flow values for each dimension of the multi-dimensional data comprises:
monitoring a total flow of the multidimensional data; and
if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
3. The method of claim 2, wherein obtaining the normal flow value and the abnormal flow value for each dimension of the multi-dimensional data within the preset time period comprises:
and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
4. The method of claim 2, wherein obtaining the normal flow value and the abnormal flow value for each dimension of the multi-dimensional data within the preset time period comprises:
counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and
and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
5. The method of claim 2, wherein obtaining the normal flow value and the abnormal flow value for each dimension of the multi-dimensional data within the preset time period comprises:
predicting the flow data value of each dimension in the preset time period;
and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
6. The method of any one of claims 1-5, wherein screening suspected root dimensions using the decision tree comprises:
taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set;
balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state;
calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and
determining the path of the constructed decision tree as a suspected root dimension.
7. The method of claim 6, wherein balancing the positive and negative sample weights comprises: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
8. The method according to any one of claims 1 to 5, wherein identifying whether the suspected root cause dimension is a root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension, comprises:
and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
9. A data analysis apparatus for multidimensional data, comprising:
the flow acquiring unit is used for acquiring a normal flow value and an abnormal flow value of each dimension in a dimension combination of multi-dimensional data, wherein the abnormal flow value comprises access failure times, and the normal flow value comprises access success times;
the dimensionality screening unit is used for inputting a dimensionality combination of multi-dimensional data and a normal flow value and an abnormal flow value of the dimensionality combination into a decision tree, and screening suspected root dimensionality from the dimensionality combination of the multi-dimensional data by using the decision tree;
the characteristic calculation unit is used for calculating the contribution degree of the suspected root cause dimension and the consistency degree of the loss degree of the sub-dimension; the contribution degree of the suspected root dimension is the ratio of the abnormal flow value of the suspected root dimension to the abnormal flow value of the total dimension, and the consistency degree of the loss degrees of the sub-dimensions is measured by using a variation coefficient; and
and the identification unit is used for identifying whether the suspected root cause dimension is the root cause dimension according to the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimension of the suspected root cause dimension, wherein the root cause dimension is the data dimension corresponding to the root cause causing the flow loss.
10. The apparatus according to claim 9, wherein the flow rate obtaining unit includes:
a monitoring subunit, configured to monitor a total flow of the multidimensional data; and
an acquisition subunit to: if the total flow of the multi-dimensional data in a preset time period is monitored to have flow loss, acquiring a normal flow value and an abnormal flow value of each dimension of the multi-dimensional data in the preset time period.
11. The apparatus of claim 10, wherein the obtaining subunit is further configured to: and determining the difference value between the acquired flow data value of each dimension in the preset time period and the acquired flow data value of each dimension in the specified time period as the abnormal flow value of each dimension.
12. The apparatus of claim 10, wherein the obtaining subunit is further configured to:
counting the times of failed accesses of all dimensions in the preset time period, wherein the accesses which do not receive reply information in the preset time period are taken as the failed accesses; and
and determining the number of times of access failure of each dimension as the abnormal flow value of each dimension.
13. The apparatus of claim 10, wherein the obtaining subunit is further configured to:
predicting the flow data value of each dimension in the preset time period;
and determining the difference value between the acquired flow data value of each dimension in the preset time period and the predicted flow data value of each dimension in the preset time period as the abnormal flow value of each dimension.
14. The apparatus of any one of claims 9-13, wherein the dimension screening unit is further configured to:
taking the abnormal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in a negative example set;
balancing the weights of the positive and negative samples to make the weights of the positive and negative samples equal in the initial state;
calculating the information gain rate of each dimension according to the balanced positive and negative sample weights, selecting the dimension with the largest information gain rate for division, and constructing the decision tree; and
determining the path of the constructed decision tree as a suspected root dimension.
15. The apparatus of claim 14, wherein the balancing of positive and negative example sample weights comprises: and taking the product of the abnormal flow value of the dimension combination of the multi-dimensional data and a balance coefficient as the weight of the dimension combination in the positive example set, and taking the normal flow value of the dimension combination of the multi-dimensional data as the weight of the dimension combination in the negative example set, wherein the balance coefficient is the ratio of the sum of the normal flow values of all dimensions of the multi-dimensional data to the sum of the abnormal flow values of all dimensions.
16. The apparatus according to any of claims 9-13, wherein the identification unit is further configured to: and inputting the calculated contribution degree and the calculated consistency degree of the loss degree of the sub-dimensionality of the suspected root cause dimension into a classifier, and classifying whether the suspected root cause dimension is the root cause dimension.
17. A data analysis apparatus for multidimensional data, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN201810400910.7A 2018-04-28 2018-04-28 Data analysis method and device for multi-dimensional data and storage medium Active CN108683530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810400910.7A CN108683530B (en) 2018-04-28 2018-04-28 Data analysis method and device for multi-dimensional data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810400910.7A CN108683530B (en) 2018-04-28 2018-04-28 Data analysis method and device for multi-dimensional data and storage medium

Publications (2)

Publication Number Publication Date
CN108683530A CN108683530A (en) 2018-10-19
CN108683530B true CN108683530B (en) 2021-06-01

Family

ID=63802628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810400910.7A Active CN108683530B (en) 2018-04-28 2018-04-28 Data analysis method and device for multi-dimensional data and storage medium

Country Status (1)

Country Link
CN (1) CN108683530B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858821A (en) * 2019-02-14 2019-06-07 金瓜子科技发展(北京)有限公司 A kind of influence feature determines method, apparatus, equipment and medium
CN110009012B (en) * 2019-03-20 2023-06-16 创新先进技术有限公司 Risk sample identification method and device and electronic equipment
CN110995524B (en) * 2019-10-28 2022-06-14 北京三快在线科技有限公司 Flow data monitoring method and device, electronic equipment and computer readable medium
CN111064614B (en) * 2019-12-17 2020-12-08 腾讯科技(深圳)有限公司 Fault root cause positioning method, device, equipment and storage medium
CN111314173B (en) * 2020-01-20 2022-04-08 腾讯科技(深圳)有限公司 Monitoring information abnormity positioning method and device, computer equipment and storage medium
CN111241128A (en) * 2020-01-21 2020-06-05 北京字节跳动网络技术有限公司 Data processing method and device and electronic equipment
CN113220796A (en) * 2020-01-21 2021-08-06 北京达佳互联信息技术有限公司 Abnormal business index analysis method and device
CN113535444B (en) * 2020-04-14 2023-11-03 中国移动通信集团浙江有限公司 Abnormal motion detection method, device, computing equipment and computer storage medium
CN111209179A (en) * 2020-04-23 2020-05-29 成都四方伟业软件股份有限公司 Method, device and system for collecting and analyzing system operation and maintenance data
CN112015995A (en) * 2020-09-29 2020-12-01 北京百度网讯科技有限公司 Data analysis method, device, equipment and storage medium
CN113746798B (en) * 2021-07-14 2022-05-06 清华大学 Cloud network shared resource abnormal root cause positioning method based on multi-dimensional analysis
CN114900835A (en) * 2022-04-20 2022-08-12 广州爱浦路网络技术有限公司 Malicious traffic intelligent detection method and device and storage medium
CN115578078A (en) * 2022-11-15 2023-01-06 云智慧(北京)科技有限公司 Data processing method, device and equipment of operation and maintenance system
CN116227995B (en) * 2023-02-06 2023-09-12 北京三维天地科技股份有限公司 Index analysis method and system based on machine learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107154880A (en) * 2016-03-03 2017-09-12 阿里巴巴集团控股有限公司 system monitoring method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2016203497B2 (en) * 2015-06-22 2017-05-11 Accenture Global Services Limited Wi-fi access point performance management
CN107025154B (en) * 2016-01-29 2020-12-01 阿里巴巴集团控股有限公司 Disk failure prediction method and device
CN106874574B (en) * 2017-01-22 2019-10-29 清华大学 Mobile application performance bottleneck analysis method and device based on decision tree

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107154880A (en) * 2016-03-03 2017-09-12 阿里巴巴集团控股有限公司 system monitoring method and device

Also Published As

Publication number Publication date
CN108683530A (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN108683530B (en) Data analysis method and device for multi-dimensional data and storage medium
US10983856B2 (en) Identifying root causes of performance issues
CN109587008B (en) Method, device and storage medium for detecting abnormal flow data
US20180107528A1 (en) Aggregation based event identification
CN110347561B (en) Monitoring alarm method and terminal equipment
Jiang et al. Efficient fault detection and diagnosis in complex software systems with information-theoretic monitoring
US10467087B2 (en) Plato anomaly detection
CN111314173B (en) Monitoring information abnormity positioning method and device, computer equipment and storage medium
CN110471945B (en) Active data processing method, system, computer equipment and storage medium
CN112491611A (en) Fault location system, method, apparatus, electronic device and computer readable medium
CN107861981A (en) A kind of data processing method and device
CN111367782B (en) Regression testing data automatic generation method and device
CN113438123B (en) Network flow monitoring method and device, computer equipment and storage medium
CN117290719B (en) Inspection management method and device based on data analysis and storage medium
CN110348717A (en) Base station value methods of marking and device based on grid granularity
CN113946983A (en) Method and device for evaluating weak links of product reliability and computer equipment
CN113886373A (en) Data processing method and device and electronic equipment
CN107463486B (en) System performance analysis method and device and server
CN112836124A (en) Image data acquisition method and device, electronic equipment and storage medium
CN117221087A (en) Alarm root cause positioning method, device and medium
CN111783883A (en) Abnormal data detection method and device
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN116319255A (en) Root cause positioning method, device, equipment and storage medium based on KPI
JP5640796B2 (en) Name identification support processing apparatus, method and program
CN115203556A (en) Score prediction model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant