Detailed Description
In order for those skilled in the art to better understand the technical solutions in the embodiments of the present specification, the technical solutions in the embodiments of the present specification will be described in detail below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification shall fall within the scope of protection.
A service sample in this specification refers to a service sample that contains a plurality of service features. Such as a service instance. The interface of the system receives processing requests initiated by the calling party at any moment, the request participation and the return results are different, and in a plurality of service characteristic dimension ranges, different and unique calling forms in the unit time range are formed by aggregating the calls in the unit time range of the same service characteristic value, namely the service instance samples.
The service characteristics of the service instance include, but are not limited to, "interface, interface request parameter, interface return parameter, request magnitude, directed loop-free structure of internal nodes of the system, invoked upstream and downstream systems, deployment unit", etc., and each service characteristic typically has multiple values. For example, for a business feature "age", it may include values that are "teenager", "young", "middle-aged" and "elderly". In short, a service instance is an abstraction of a similar service call. The business system may determine whether the state of an instance is normal or abnormal by some means (e.g., establishing a normal instance library for comparison).
Thus, detection may be based on the business feature values contained in the instances. For example, a combination of several specific service feature values is taken as an alarm dimension, for example, "interface=1, return parameter=0, request magnitude=3", and once a certain service sample contains the combination of the values of the several service features, the anomaly of the service sample is determined. The current alarm dimension determination depends on professional personnel with abundant business experience to configure, and the accuracy is influenced by personnel experience and scene change. Based on the above, the embodiment of the specification provides an alarm dimension mining method, which realizes automatic mining of more accurate alarm dimensions.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings. As shown in fig. 1, fig. 1 is a schematic flow chart of an alarm dimension mining method provided in an embodiment of the present disclosure, where the flow specifically includes the following steps:
s101, acquiring a service sample set containing a normal service sample and an abnormal service sample, wherein any service sample contains characteristic values of a plurality of service characteristics.
The manner in which the traffic samples are obtained may be to extract portions from historical data in a period of time closer to the current time to use as training samples. In the historical data, comparison can be performed according to a pre-established normal service sample library, so that whether one service sample is normal or abnormal can be judged.
S103, constructing a decision tree to classify the service sample set.
The root node of the decision tree corresponds to the set of business samples. Since each service sample contains multiple service features, each service feature may have multiple values. Therefore, all the service samples can be classified based on the value of a certain service characteristic, and finally all the service samples are classified to obtain a decision tree. The method is characterized in that the method is carried out on a decision tree, namely, splitting is carried out from a root node to obtain a plurality of child nodes, and the child nodes can continue splitting until a certain splitting cut-off condition is met, so that leaf nodes are obtained. Each child node corresponds to a subset of traffic samples.
Because the service features are multiple, one service feature is required to be selected for splitting when the decision tree is split, so that the decision tree obtained by splitting has better classification effect. Namely, the categories of the sub-nodes are made as pure as possible, wherein pure refers to that labels of service samples in the sub-nodes are normal or abnormal as much as possible.
There are many methods of selecting classified traffic characteristics, and a common method may determine which traffic characteristic to split according to based on an information gain parameter. For a set of service samples to be classified, pre-splitting and post-splitting information gain parameters are used to represent the degree of information increase when the decision tree splits with the value of the service feature. The larger the information gain parameter, the faster the purity of the service sample set is increased if the service sample set is divided according to the current characteristic. The information gain parameter includes an information gain amount, an information gain ratio, a kunit, or the like, for indicating an information increase degree when the decision tree splits with the value of the service feature.
Wherein, for each subset of the service samples to be classified (the service sample set itself can be regarded as a subset of itself), the information amount can be expressed by a specific calculation mode, and the information gain amount represents the difference between the information amounts after and before classification; the information gain may be a ratio of the amount of information gain to the amount of information before splitting; the coefficient of keni characterizes the purity of the collection, and the smaller the Gini index, the smaller the probability that a selected sample in the collection of samples is misclassified, that is, the higher the purity of a subset of the traffic samples after splitting. The calculation mode of the specific information gain parameter can be defined by self.
S105, determining paths from leaf nodes to root nodes, which meet alarm conditions, in the decision tree, wherein the paths represent a combination of service characteristic values.
The constructed decision tree is a decision tree with the best classification effect on the service sample set. The paths in the decision tree represent a combination of traffic feature values corresponding to an optimal classification strategy/rule. Thus, some alarm conditions may be set according to actual needs, and specifically, the alarm conditions may include, for example, a limitation on the number of nodes in the path, a limitation on the proportion of abnormal traffic samples of each node in the path, a statistical limitation on the abnormal duty ratio of each node in the path, and so on. Therefore, paths with better classification effect on abnormal service samples can be screened out from the decision tree. As shown in fig. 2, fig. 2 is a schematic diagram of path selection in a decision tree according to an embodiment of the present disclosure, where a dashed line portion represents a path meeting an alarm condition.
And S107, determining the combination of the service characteristic values corresponding to the paths as an alarm dimension so as to classify unlabeled service data according to the alarm dimension.
Specifically, a combination of service characteristic values corresponding to the paths is determined. For example, in fig. 2, the selected alarm dimension is "e→c→d", and the corresponding feature value combination is "d=d2" & "c=c2". The alarm dimension described above generally does not need to include the node orientation therein.
By constructing a decision tree with the best classifying effect on sample data and finding out a path with abnormal probability conforming to the actual alarming condition from the decision tree, the selected alarming dimension can locally maximize the abnormal probability for classification. By the mode, the alarm dimension is prevented from being selected by manual experience, the accuracy of configuration of the alarm dimension is improved, and the selection efficiency of the alarm dimension is improved.
In the process of constructing the decision tree, classification can be performed based on the values of all the service features until all the service features are exhausted. However, some modes can be adopted to terminate the splitting of the child nodes in advance and directly generate leaf nodes so as to improve the construction speed and the generalization capability of the decision tree.
For example, if the labels of the traffic samples in a node are the same, the node is determined to be a leaf node. I.e. in case the traffic samples in a child node are normal or abnormal, no further partitioning of the node is required.
For example, if the abnormal traffic sample duty cycle in a node exceeds a threshold, the node is determined to be a leaf node. I.e. it is significantly reflected in a node, and the node is determined to be a leaf node when the path containing the node may have a better classification for the abnormal traffic sample.
In addition, there may be cases where the service characteristics included in the service instance samples are not exactly the same. For example, examples A and B are contained in one sample set. Suppose that instance a contains business features (a, B, c, d) and instance B contains business features (a, B, c, e). If the special service feature e is taken as the split service feature, B can be split into the next sub-node according to the value. Since there is no traffic feature e in sample a, a leaf node may be generated at this time, into which sample set corresponding to traffic feature e is divided.
The maximum gain parameter for information when constructing the decision tree can be achieved in a variety of ways. The method comprises the following steps: for a service sample subset corresponding to any non-leaf node, determining service characteristics contained in service samples in the service sample subset, wherein each service characteristic comprises a plurality of values; calculating a first information amount of the non-leaf nodes, and calculating a second information amount of the nodes after splitting the service sample subset according to any service characteristic; determining information gain parameters of all service features in the sample subset according to the first information quantity and the second information quantity; and determining the service characteristic with the maximum information gain parameter as a split service characteristic, and splitting the non-leaf node according to the value of the split service characteristic.
Specifically, the information quantity (i.e. the first information quantity) of the node to be split is firstly determined, any service feature is split according to the value of the service feature, the information quantity (i.e. the second information quantity) after the split is calculated according to the child node, and then an information gain parameter is determined. And traversing the service features contained in the node, and selecting a maximum value from all the information gain parameters, wherein the service features have the best classification effect on the node to be split and can be used as split service features. And for the child node obtained after splitting, removing the splitting service feature from the service feature. I.e. the already used split traffic characteristics are not considered anymore when splitting again for the child node.
The method for calculating the information gain parameter may take various forms. For example, for the service sample set D, the information amount is defined as follows:wherein p is i Representing the probability of an ith traffic feature occurring in the set of traffic samples, the number of traffic samples containing the traffic feature may be divided by the total number of traffic samples in the set as an estimate. If the service sample set D is divided according to the attribute A, the information after the division of the service sample set D according to the attribute A is as follows: />D j I.e. j service sample subsets obtained according to j values of the service feature a. The information gain is the difference between the two: gain (a) =info (D) -info A (D) A. The invention relates to a method for producing a fibre-reinforced plastic composite The decision tree calculates the information gain of each service feature each time a split is required,and then selecting the service characteristic with the maximum information gain for splitting.
For another example, when classifying using the information gain ratio, the information amount is defined in another manner, and the information amount is defined as follows:then define for gain ratioThat is, the information gain amount of the traffic feature is first determined from the information amount before splitting (first information amount) and the information amount after splitting (second information amount), and then the ratio of the information gain amount and the splitting information amount is determined as the information gain ratio. And, some smoothing may be added to the information amount after splitting, for example, a smoothing term Ave (split_info (a)) representing the average value of the information amounts of the child nodes is added to the denominator of the gain ratio, so that the denominator becomes split_info (a) +ave (split_info (a)).
For another example, for a given set of traffic samples D, assume there are k traffic features, the number of k traffic features being C k, The coefficient of the sample D is expressed as:and then finding out the partition with the smallest coefficient of the foundation from all the possible partitions, wherein the service characteristic corresponding to the partition point is the optimal split service characteristic for dividing the sample set D.
In summary, after the service sample data set is input, the decision tree construction mode may be represented by the following pseudo code:
1. constructing a sample service set corresponding to the root nodes N, N;
2. if the labels of all traffic samples in the current node are the same: counting the abnormal probability under the node to generate a leaf node;
3. if the attribute is null: counting the abnormal probability under the node to generate a leaf node;
4. if the abnormal probability of the current node is greater than the threshold value, counting the abnormal probability of the current node to generate a leaf node;
5. selecting a service feature A with the maximum information gain parameter in the candidate service feature set;
6. dividing the sample according to the values j of A, subtracting the split service feature A from the service feature list, and recording the data of the jth branch as A j ;
7. If A j If the node is empty, a leaf node is newly built, and the abnormal probability under the node is counted to generate the leaf node;
8. otherwise recursively calling to obtain subtree node N j 。
By the method, a decision tree with the best classifying effect on the input business sample set can be obtained. When the method for constructing the decision tree is used for mining the alarm dimension, if d samples are assumed in the service sample set and n contained service features are included, if the method for constructing the decision tree is used for mining the dimension by adopting the alarm combination, the complexity is thatWhen the alarm dimension selection is performed in the above manner, the time complexity is optimized to O (n×d×log (d)) 3 ) It can be seen that in the process of automatically selecting the alarm dimension by using the big data sample, the scheme provided by the embodiment of the specification can greatly improve the efficiency.
In one embodiment, the selection of the alarm condition may be based on the following: the abnormal sample duty cycle in the nodes of the path exceeds a proportional threshold; and in the service sample sequence with a plurality of continuous unit time intervals, the statistical standard score of the abnormal sample duty ratio sequence exceeds the standard score threshold.
A path from a leaf node to a root node comprises a plurality of nodes, each node comprises a normal sample and an abnormal sample, the proportion of the number of all abnormal samples to the total number of samples on the path is counted, namely the proportion of the abnormal samples in the nodes of the path is the proportion of the abnormal samples in the nodes of the path, and the threshold value of the proportion of the abnormal samples can be preset.
The unit time interval may be set to, for example, 30s, 60s, or the like, depending on the actual situation. Under the condition of acquiring data in real time, service instance samples containing the path in a plurality of continuous unit time intervals can be counted to obtain a sample sequence containing the path, and further an abnormal duty ratio sequence of the sample sequence can be obtained. The anomaly duty may be calculated by the ratio of the number of anomaly samples to the total number of samples. Assuming that the abnormal duty ratio sequence under the alarm dimension is X i ={X 1 ,X 2 ,...,X n Then the mean u and standard deviation σ of the anomaly duty sequence can be counted, and the standard score for the sequence can be calculated in such a way that the standard score z= (Max (X i ) -u)/σ. The standard score in this calculation mode reflects the degree to which the sample with the largest abnormal ratio deviates from the normal abnormal ratio in the sample sequence containing the path, and if the degree exceeds the preset threshold, the service sample containing the path is considered to deviate from the normal condition. In other words, the path may be used for the alarm dimension.
In a general case, the alarm dimension does not need to be changed. The method can alarm the unlabeled business samples. If the service data changes as the service progresses, the alarm effect in the alarm dimension becomes worse. Then, only another batch of current service samples is needed to be obtained immediately, and another batch of alarm dimension substitution is dug out according to the current service samples.
Further, in the statistical standard time sharing, if the number of nodes of the path is too small, too much data corresponding to the alarm dimension is easily caused, and abnormal service samples cannot be accurately reflected. Thus, another condition may also be added to the alarm conditions used to screen the alarm dimension: the number of nodes in the path exceeds a number threshold. After the alarm dimension is obtained through the alarm condition screening decision tree, the combination of the service characteristic values corresponding to the paths can be used as the alarm dimension. As shown in fig. 3, fig. 3 is a schematic flow chart of acquiring an alarm dimension according to an embodiment of the present disclosure.
On the other hand, after the alarm dimension is obtained, the embodiment of the present disclosure may further perform data classification based on the alarm dimension obtained by the above scheme, which specifically includes: determining service characteristic values of the service data with the classification; and if the service characteristic value of the service data contains the alarm dimension, determining that the service data is abnormal service data.
Correspondingly, the embodiment of the present disclosure further provides an alarm dimension excavating device, as shown in fig. 4, fig. 4 is a schematic structural diagram of the alarm dimension excavating device provided in the embodiment of the present disclosure, including:
the acquiring module 401 acquires a service sample set containing a normal service sample and an abnormal service sample, wherein any service sample contains characteristic values of a plurality of service characteristics;
the construction module 403 is configured to construct a decision tree to classify the service sample set, where each node in the decision tree represents a service sample subset, a root node in the decision tree corresponds to the service sample set, splitting is performed by taking a feature value of a service feature as an edge, and an information gain parameter is maximum when splitting is performed on each non-leaf node, where the information gain parameter includes an information gain amount, an information gain ratio or a kunning coefficient, and is used to represent an information increase degree when splitting is performed on the decision tree by taking a value of the service feature;
a path determining module 405, configured to determine a path from a leaf node to a root node in the decision tree, where the path meets an alarm condition, and the path represents a combination of service feature values;
the dimension determining module 407 determines a combination of service feature values corresponding to the paths as an alarm dimension, so as to classify unlabeled service data according to the alarm dimension.
Further, the building module 403 determines that a node is a leaf node if the labels of the service samples in the node are the same; or if the abnormal service sample duty ratio in one node exceeds a threshold value, determining the node as a leaf node; or if the feature value of one service feature is taken as an edge to split, the service sample set contains service samples which do not contain the service feature, and a leaf node corresponding to the service sample which does not contain the service feature is generated.
Further, the construction module 403 determines, for a service sample subset corresponding to any non-leaf node, service features included in service samples in the service sample subset, where each service feature includes a plurality of values; calculating a first information amount of the non-leaf nodes, and calculating a second information amount of the nodes after splitting the service sample subset according to any service characteristic; determining information gain parameters of all service features in the sample subset according to the first information quantity and the second information quantity; and determining the service characteristic with the maximum information gain parameter as a split service characteristic, and splitting the non-leaf node according to the value of the split service characteristic.
Further, the construction module 403 determines a difference between the first information amount and the second information amount as an information gain amount of the service feature; or, according to the difference between the first information quantity and the second information quantity, determining the information gain quantity of the service characteristic, determining the split information quantity contained in the split node, and determining the ratio of the information gain quantity and the split information quantity as the information gain ratio.
Further, the alarm conditions in the device include: the abnormal sample duty cycle in the nodes of the path exceeds a proportional threshold; and in the service sample sequence with a plurality of continuous unit time intervals, the statistical standard score of the abnormal sample duty ratio sequence exceeds the standard score threshold.
Further, the alarm condition in the apparatus further comprises the number of nodes in the path exceeding a number threshold.
On the other hand, the embodiment of the present disclosure further provides a service data classification device based on an upper alarm dimension, including:
the determining module is used for determining the service characteristic value of the service data;
and the judging module is used for determining that the service data is abnormal service data if the service characteristic value of the service data contains the alarm dimension.
The embodiment of the present disclosure also provides a computer device, which at least includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the method for mining the alarm dimension shown in fig. 1 when executing the program.
FIG. 5 illustrates a more specific hardware architecture diagram of a computing device provided by embodiments of the present description, which may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The present embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the mining method of the alert dimension shown in fig. 1.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.
The system, method, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the method embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The above-described method embodiments are merely illustrative, in that the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, and these improvements and modifications should also be considered as protective scope of the embodiments of this disclosure.