CN117150294A - Outlier detection method, outlier detection device, electronic equipment and storage medium - Google Patents

Outlier detection method, outlier detection device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117150294A
CN117150294A CN202311088730.7A CN202311088730A CN117150294A CN 117150294 A CN117150294 A CN 117150294A CN 202311088730 A CN202311088730 A CN 202311088730A CN 117150294 A CN117150294 A CN 117150294A
Authority
CN
China
Prior art keywords
sample
rule
training
decision tree
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311088730.7A
Other languages
Chinese (zh)
Inventor
林建明
杨懿宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Samoye Digital Technology Co ltd
Original Assignee
Shenzhen Samoye Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Samoye Digital Technology Co ltd filed Critical Shenzhen Samoye Digital Technology Co ltd
Priority to CN202311088730.7A priority Critical patent/CN117150294A/en
Publication of CN117150294A publication Critical patent/CN117150294A/en
Pending legal-status Critical Current

Links

Abstract

The application relates to an outlier detection method, an outlier detection device, electronic equipment and a storage medium, wherein the outlier detection method comprises the following steps: obtaining a sample to be predicted input by a user; inputting a sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value represents the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels. According to the application, the unsupervised modeling process of the isolated forest is changed into the supervised modeling process, and the traditional isolated forest algorithm is changed into the algorithm which is more suitable for the actual financial wind control scene, so that the effect of the application level is improved, and the limitation of the isolated forest algorithm in the financial wind control scene is overcome.

Description

Outlier detection method, outlier detection device, electronic equipment and storage medium
Technical Field
The present application relates to the field of information extraction, and in particular, to a method and apparatus for detecting an abnormal value, an electronic device, and a storage medium.
Background
An isolated forest is one of the traditional algorithms for outlier detection, where outlier detection is the process of finding outliers in data, which are data points that differ significantly from other data points in a given data set.
However, the use of the isolated forest algorithm in a financial wind control scenario has limitations because the isolated forest algorithm is unsupervised modeling.
Aiming at the problem of limitation of the isolated forest algorithm in a financial wind control scene, no effective solution is proposed at present.
Disclosure of Invention
The application provides an outlier detection method, an outlier detection device, electronic equipment and a storage medium, and aims to solve the problem of limitation of an existing isolated forest algorithm in a financial wind control scene.
In a first aspect, the present application provides an outlier detection method, the method comprising:
obtaining a sample to be predicted input by a user;
inputting the sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value characterizes the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels.
Optionally, generating the outlier detection model includes:
acquiring a training sample input by a user, wherein the training sample has a corresponding label;
training a single decision tree based on training samples and labels input by a user;
Decomposing the trained single decision tree to obtain a decomposed first rule set;
under the condition that the information entropy of the rule hit training sample in the first rule set is smaller than or equal to a set threshold value, putting the rule into a pure entropy queue;
placing the rule into a priority queue under the condition that the information entropy of the rule hit training sample in the first rule set is larger than a set threshold value;
comparing the rule number in the priority queue with a preset number to obtain a comparison result;
and taking a forest formed by the trained decision tree as an abnormal value detection model under the condition that the comparison result is that the rule number in the priority queue is smaller than or equal to the preset number.
Optionally, the method further comprises:
re-extracting the rules in the priority queue to obtain extraction rules when the comparison result is that the number of the rules in the priority queue is larger than the preset number;
determining a first sub-sample based on the extraction rule, wherein the first sub-sample comprises a sample covered by leaf nodes in a trained decision tree;
training a new single decision tree based on the extraction rule and the first subsamples;
Decomposing the trained new single decision tree to obtain a decomposed second rule set;
under the condition that the information entropy of the rule hit first subsamples in the second rule set is smaller than or equal to a set threshold value, placing the rule into a pure entropy queue;
placing the rule in a priority queue under the condition that the information entropy of the rule hit first sub-sample in the second rule set is larger than a set threshold value;
and returning to the step of comparing the rule number in the priority queue with a preset number to obtain a comparison result.
Optionally, the inputting the sample to be predicted into an outlier detection model to obtain an outlier of the sample to be predicted includes:
acquiring a third rule set of a sample hit pure entropy queue to be predicted;
comparing the average response rate of all the rule hit training samples in the third rule set with the average response rate of the training samples to obtain a comparison result;
when the comparison result is that the average response rate of the rule hit training samples in the third rule set is larger than the average response rate of the training samples, adding the length of the rule on the basis of a preset score, and updating the preset score;
Subtracting the length of the rule on the basis of a preset score and updating the preset score under the condition that the average response rate of the rule hit training samples in the third rule set is smaller than or equal to the average response rate of the training samples as the comparison result;
and under the condition that all rules in the third rule set complete the comparison of the average response rate of hitting the training sample and the average response rate of the training sample, determining the preset score as the abnormal value of the sample to be predicted.
Optionally, before the training of the single decision tree based on the training samples and labels input by the user, the method further comprises:
detecting whether the training sample and the label input by the user are compliant;
training of a single decision tree based on training samples and labels entered by a user, comprising:
and under the condition that the training samples input by the user and the labels corresponding to the training samples are detected to be compliant, training the single decision tree based on the training samples input by the user and the labels.
Optionally, the decomposing the trained single decision tree to obtain a first rule set after decomposition includes:
decomposing from the root node of the trained single decision tree to obtain the path from the root node of the single decision tree to each leaf node;
And determining the path from the root node of the single decision tree to each leaf node as a plurality of rules decomposed by the single decision tree, wherein the plurality of rules form the first rule set.
Optionally, the method further comprises:
acquiring decision tree parameters input by a user;
the training of the single decision tree based on the training sample and the label input by the user comprises the following steps:
training of the single decision tree is performed based on training samples, labels and decision tree parameters entered by the user.
In a second aspect, the present application provides an outlier detection apparatus, the apparatus comprising:
the sample to be predicted obtaining unit is used for obtaining a sample to be predicted input by a user;
the abnormal value acquisition unit is used for inputting the sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value represents the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels.
In a third aspect, the present application provides an electronic device, comprising: at least one communication interface; at least one bus connected to the at least one communication interface; at least one processor coupled to the at least one bus; at least one memory coupled to the at least one bus, wherein the processor performs the method of outlier detection according to any of the above-described embodiments of the application via a computer program.
In a fourth aspect, the present application also provides a computer storage medium storing computer-executable instructions for performing the outlier detection method according to any one of the above-described aspects of the present application.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the sample to be predicted input by the user is obtained; inputting a sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value represents the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels. According to the application, the unsupervised modeling process of the isolated forest is changed into the supervised modeling process, and the traditional isolated forest algorithm is changed into the algorithm more suitable for the actual financial wind control scene, so that the effect of the application level is improved, and the limitation of the isolated forest algorithm in the financial wind control scene is overcome.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a flowchart of an outlier detection method according to an embodiment of the present application;
FIG. 2 is a flowchart of generating an outlier detection model according to an embodiment of the present application;
fig. 3 is a schematic diagram of an abnormal value detection apparatus according to an embodiment of the present application;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following disclosure provides many different embodiments, or examples, for implementing different structures of the application. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the application. Furthermore, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
First, part names or terms appearing in the course of describing the embodiments of the present application are applicable to the following explanation:
information entropy: the expected value of the information describes the uncertainty of the information. The greater the entropy, the higher the degree of confusion that indicates the aggregate information.
Decision tree: the method comprises a root node, a plurality of internal nodes and leaf nodes, wherein the leaf nodes correspond to decision results, other nodes (root nodes and internal nodes) correspond to attribute judgment rules, and the decision tree is essentially judged in a layer-by-layer mode according to condition recursion.
An isolated forest is one of the traditional algorithms for outlier detection, where outlier detection is the process of finding outliers in data, which are data points that differ significantly from other data points in a given data set.
However, the use of the isolated forest algorithm in a financial wind control scenario has limitations because the isolated forest algorithm is unsupervised modeling.
In order to solve the problem that the use of the existing isolated forest algorithm in a financial wind control scene is limited, the application provides an outlier detection method, and the traditional isolated forest algorithm is changed into an algorithm more suitable for a financial wind control actual scene by changing an unsupervised modeling process into a supervised modeling process, so that the limitation of the use of the traditional isolated forest algorithm in the financial wind control scene is overcome.
Fig. 1 is a flowchart of an outlier detection method according to an embodiment of the present application. An outlier detection method according to an embodiment of the present application is described below with reference to fig. 1.
As shown in fig. 1, an outlier detection method includes:
s101, acquiring a sample to be predicted input by a user;
outlier detection is of great importance in data mining, for example, if outliers are due to variations in the data itself, then analysis of them can reveal hidden, more profound, potentially valuable information. Such as finding financial and insurance fraud, hacking, and also pursuing consumption by very low or very high consumer groups. And under the condition that the user needs to detect the abnormal value in the data, acquiring a sample to be predicted input by the user.
S102, inputting a sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value represents the abnormal condition of the sample to be predicted, the abnormal value detection model is obtained by model training based on a training sample input by a user, and the training sample has a corresponding label.
In this embodiment, a sample to be predicted input by a user is input into an abnormal value detection model, and the abnormal value detection model outputs an abnormal value of the sample to be predicted.
It should be noted that, the outlier is used to characterize the anomaly condition and anomaly probability of the sample to be predicted input by the user. The larger the value of the abnormal value is, the more abnormal the sample to be predicted is input by the user; the smaller the value of the outlier, the less outlier the sample to be predicted that represents the user input.
According to the method, the sample to be predicted, which is input by a user, is acquired and input into an abnormal value detection model, so that the abnormal value of the sample to be predicted is acquired, wherein the abnormal value represents the abnormal condition of the sample to be predicted, the abnormal value detection model is obtained by model training based on a training sample input by the user, and the training sample has a corresponding label. According to the application, the unsupervised modeling process of the isolated forest is changed into the supervised modeling process, and the traditional isolated forest algorithm is changed into the algorithm which is more suitable for the actual financial wind control scene, so that the effect of the application level is improved, and the limitation of the isolated forest algorithm in the financial wind control scene is overcome.
As an alternative embodiment, fig. 2 is a flowchart of generating an outlier detection model according to an embodiment of the present application. The following describes a procedure for generating an outlier detection model according to an embodiment of the present application with reference to fig. 2, where generating the outlier detection model includes:
s201, acquiring a training sample input by a user, wherein the training sample has a corresponding label;
according to the application, the training sample input by the user is obtained, and the training sample is provided with the corresponding label for modeling, so that a supervised modeling process is realized.
S202, training a single decision tree based on training samples and labels input by a user;
it can be appreciated that training the single decision tree model under the condition that the training sample input by the user and the label corresponding to the training sample are obtained;
s203, decomposing the trained single decision tree to obtain a first decomposed rule set;
training the single decision tree model, and obtaining a trained single decision tree under the condition that the training reaches a preset condition. It should be noted that the tree depth of the decision tree may be fixed in the model.
After obtaining the trained single decision tree, decomposing the trained single decision tree to obtain a decomposed first rule set, wherein the first rule set comprises one or more rules.
S204, placing the rule into a pure entropy queue under the condition that the information entropy of the rule hit training sample in the first rule set is smaller than or equal to a set threshold value;
the set threshold is preset, and is generally set to 0.2. Under the condition that the information entropy of the rule hit training samples in the first rule set is smaller than or equal to 0.2, the rule is put into a pure entropy queue if the rule is considered to be used for dividing the training samples to obtain a comparison and determination result.
S205, placing rules into a priority queue under the condition that the information entropy of the rule hit training sample in the first rule set is larger than a set threshold value;
the threshold is set to be 0.2 in general, as described above. Under the condition that the information entropy of the rule hit training sample in the first rule set is larger than 0.2, the rule is put into a priority queue to facilitate re-extraction of the rule if the result obtained by dividing the training sample by the rule is not a comparison and determination result.
S206, comparing the rule number in the priority queue with the preset number to obtain a comparison result;
after all rules in the first rule set complete the division of the pure entropy queue and the priority queue, the number of rules in the priority queue is compared with a preset number, and it should be noted that the preset number is generally set to 0.
S207, taking a forest formed by the trained decision tree as an outlier detection model when the comparison result is that the rule number in the priority queue is smaller than or equal to the preset number.
Under the condition that the number of rules in the priority queue is smaller than or equal to the preset number, namely under the condition that the number of rules in the priority queue is smaller than or equal to 0, training of the outlier detection model is considered to be completed, and a forest formed by the trained decision tree is used as the outlier detection model.
According to the application, the original tree structure of the decision tree is disassembled into the rule structure, so that the interpretability of the abnormal value detection model is enhanced.
Optionally, the method further comprises:
re-extracting the rules in the priority queue to obtain extraction rules under the condition that the comparison result is that the number of the rules in the priority queue is larger than the preset number;
it can be understood that, in the case where the number of rules in the priority queue is greater than the preset number, that is, in the case where the number of rules in the priority queue is greater than 0, the rules in the priority queue need to be re-extracted to obtain extraction rules.
Illustratively, one rule in the priority queue is "is a boy, is older than 30 years old", and is re-extracted to obtain an extraction rule, which may be "is a boy, is older than 30 years old".
Determining a first sub-sample based on the extraction rule, wherein the first sub-sample comprises a sample covered by leaf nodes in a trained decision tree;
illustratively, the extraction rule is "is a boy, is older than 30 years old", and the first sub-sample is determined to be a sample covered by "is a boy, is older than 30 years old". It can be understood that, which rule in the priority queue is re-extracted, the first sub-sample corresponding to the re-extracted rule is the sample covered by the rule in the corresponding priority queue.
Training a new single decision tree based on the extraction rule and the first subsamples;
and training a new single decision tree according to the extraction rule, the first subsamples and the labels corresponding to the first subsamples, namely training a second decision tree in the forest.
Decomposing the trained new single decision tree to obtain a decomposed second rule set;
wherein the second rule set includes one or more rules. After the training of the new single decision tree is completed by utilizing the first subsamples and the corresponding labels, namely under the condition that the training reaches the preset condition, the trained new single decision tree is decomposed, and a decomposed second rule set is obtained.
Under the condition that the information entropy of the rule hit first subsamples in the second rule set is smaller than or equal to a set threshold value, placing the rule into a pure entropy queue;
the set threshold is preset, and is generally set to 0.2. When the information entropy of the rule hit first sub-sample in the second rule set is smaller than or equal to 0.2, the rule is put into a pure entropy queue if the rule is considered that the first sub-sample is divided by the rule to obtain a comparison and determination result.
Placing the rules into a priority queue under the condition that the information entropy of the rule hit first subsamples in the second rule set is larger than a set threshold value;
the threshold is set to be 0.2 in general, as described above. Under the condition that the information entropy of the rule hit the first sub-sample in the second rule set is larger than 0.2, the result obtained by dividing the first sub-sample by the rule is not a comparison and determination result, and the rule is put into a priority queue, so that the rule can be conveniently re-extracted.
And returning to execute the step of comparing the rule number in the priority queue with the preset number to obtain a comparison result.
It should be noted that, when the comparison result is that the number of rules in the priority queue is greater than the preset number, repeating the re-extraction of the rules in the priority queue to obtain extraction rules; determining a first sub-sample based on the extraction rule, wherein the first sub-sample comprises a sample covered by leaf nodes in a trained decision tree; training a new single decision tree based on the extraction rule and the first subsamples; decomposing the trained new single decision tree to obtain a decomposed second rule set; under the condition that the information entropy of the rule hit first subsamples in the second rule set is smaller than or equal to a set threshold value, placing the rule into a pure entropy queue; placing the rules into a priority queue under the condition that the information entropy of the rule hit first subsamples in the second rule set is larger than a set threshold value; and returning to the step of comparing the number of the rules in the priority queue with the preset number to obtain a comparison result until the comparison result is that the number of the rules in the priority queue is smaller than or equal to the preset number, namely, until the number of the rules in the priority queue is 0. Under the condition that the number of rules in the priority queue is 0, all rules are positioned in the pure entropy queue, a final abnormal value detection model is obtained, and a forest formed by a plurality of decision trees after training is used as the abnormal value detection model.
Under the condition that all rules are located in a pure entropy queue, a final abnormal value detection model is obtained, the actual difference of positive and negative samples is considered by the abnormal value detection model, and the scores of the abnormal value detection model are split at the positive and negative ends.
As an alternative embodiment, inputting the sample to be predicted into an outlier detection model to obtain an outlier of the sample to be predicted includes:
acquiring a third rule set of a sample hit pure entropy queue to be predicted;
after the sample to be predicted input by the user is input into the abnormal value detection model, the abnormal value detection model obtains a third rule set of the sample to be predicted hitting the pure entropy queue, wherein the third rule set comprises one or more rules.
Comparing the response rate of all the rule hit training samples in the third rule set with the average response rate of the training samples to obtain a comparison result;
it should be noted that, for each rule in the pure entropy queue, the average response rate of hitting the training sample on the building data set, that is, the average value of the target y value and the average response rate of the training sample can be distinguished, and the response rate of hitting the training sample in the third rule set and the average response rate of the training sample are compared to obtain a comparison result.
And adding the length of the rule on the basis of the preset score and updating the preset score under the condition that the average response rate of the rule hit training samples in the third rule set is larger than the average response rate of the training samples as a comparison result.
Wherein the preset score is generally 0. The length of a rule may be the number of times the sample hit by the rule has been split, illustratively "male, older than 30 years, older than 35 years," the number of times the sample hit by the rule has been split is 3, and the length of the rule is 3.
In the case that the average response rate of the rule hitting the training sample is larger than the average response rate of the training sample as a result of the comparison, the length of the rule is added on the basis of the preset score, in this example, the preset score is 0, the length of the rule is added by 3, the preset score is updated, that is, the preset score is updated to the score after the length of the rule is added, and in this example, the preset score is updated to 3.
The average response rate of the rule hitting the training sample is the probability of a sample with a label value of 1 in samples covered by the rule. Illustratively, the rule is "male, older than 30 years," which covers a sample with a probability of 0.3 for a corresponding tag of 1, and tag characterization revenue of greater than 2 ten thousand yuan. The average response rate of the training samples is 0.2 when the probability that the label corresponding to the training sample is 1 is equal to the average response rate of the training samples, and it can be seen that in this example, the average response rate of the rule hitting the training samples is greater than the average response rate of the training samples.
Subtracting the length of the rule on the basis of the preset score and updating the preset score under the condition that the average response rate of the rule hit training samples in the third rule set is smaller than or equal to the average response rate of the training samples as a comparison result;
and subtracting the length of the rule on the basis of a preset score under the condition that the average response rate of the rule hitting the training sample is larger than the average response rate of the training sample as a comparison result. It will be appreciated that if the average response rate of the training samples hit by the first rule to be compared is greater than the average response rate of the training samples, the length of the rule is subtracted on a 0-score basis, and the length of the rule is 3, for example, the updated preset score is-3.
It should be noted that, after comparing the average response rate of the hit training samples with the average response rate of the training samples for each rule in the third rule set, the preset score is updated, so that, when all rules in the third rule set complete the comparison of the average response rate of the hit training samples with the average response rate of the training samples, the preset score is determined as the outlier of the samples to be predicted, and it can be understood that the preset score is the total score of all rules in the third rule set.
As an alternative embodiment, before training the single decision tree based on the training samples and labels entered by the user, the method further comprises:
and detecting whether the training sample input by the user and the label corresponding to the training sample are in compliance.
Wherein, detect whether training sample that the user input is compliant, include:
and detecting whether the training sample input by the user is a two-class variable, namely, whether the data type and the data value range of the training sample are within the range of the two-class variable.
Detecting whether the label corresponding to the training sample is compliant, comprising:
and detecting whether the label corresponding to the training sample is a numerical variable, namely whether the data type and the data value range of the label corresponding to the training sample are within the range of the numerical variable.
It can be understood that, in the case that any one of the training samples input by the user and the labels corresponding to the training samples is detected to be non-compliant, the steps of training and subsequent steps of performing a single decision tree based on the training samples and the labels input by the user are stopped; and executing the steps of training a single decision tree based on the training sample and the label input by the user and then under the condition that the training sample input by the user and the label corresponding to the training sample are detected to be in compliance.
As an alternative embodiment, decomposing the trained single decision tree to obtain a first decomposed rule set includes:
decomposing from the root node of the trained single decision tree to obtain the path from the root node of the single decision tree to each leaf node;
the path from the root node of the single decision tree to each leaf node is determined as a plurality of rules after decomposition of the single decision tree, the plurality of rules comprising a first rule set.
Optionally, the method further comprises:
acquiring decision tree parameters input by a user;
training of a single decision tree based on training samples and labels entered by a user, comprising:
training of the single decision tree is performed based on training samples, labels and decision tree parameters entered by the user.
It should be noted that the decision tree parameters may also be parameters such as tree depth and node splitting evaluation indexes of the decision tree input by the user.
Of course, before training of a single decision tree based on the training samples, labels and decision tree parameters input by the user, whether the decision tree parameters input by the user are compliant or not needs to be detected, and under the condition that the numerical types and the data value ranges of the labels corresponding to the training samples, the decision tree parameters input by the user are within the required ranges, the training of the single decision tree is performed based on the training samples, the labels and the decision tree parameters input by the user.
According to another embodiment of the present application, there is provided an abnormal value detection apparatus, as shown in fig. 3, including:
a sample to be predicted obtaining unit 301, configured to obtain a sample to be predicted input by a user;
an outlier obtaining unit 302, configured to input a sample to be predicted into an outlier detection model, to obtain an outlier of the sample to be predicted, where the outlier characterizes an abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels.
It should be noted that, the sample to be predicted acquiring unit 301 in this embodiment may be used to perform step S101 in the embodiment of the present application, and the outlier acquiring unit 302 in this embodiment may be used to perform step S102 in the embodiment of the present application.
It should be noted that the above units are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. The above-described units may be implemented in software or hardware as part of the apparatus.
Through the units, an unsupervised modeling process of the isolated forest is changed into a supervised modeling process, a traditional isolated forest algorithm is changed into an algorithm which is more suitable for a financial wind control actual scene, the effect of an application level is improved, and the limitation of the isolated forest algorithm in the financial wind control scene is overcome.
As an alternative embodiment, the apparatus further comprises: an abnormal value detection model generation unit, wherein the abnormal value detection model generation unit is used for generating an abnormal value detection model.
The abnormal value detection model generation unit includes:
the training sample acquisition unit is used for acquiring training samples input by a user, wherein the training samples have corresponding labels;
the single decision tree training unit is used for training a single decision tree based on training samples and labels input by a user;
the single decision tree decomposition unit is used for decomposing the trained single decision tree to obtain a decomposed first rule set;
a pure entropy queue obtaining unit for placing the rule into the pure entropy queue when the information entropy of the rule hit training sample in the first rule set is less than or equal to a set threshold
The priority queue obtaining unit is used for placing the rule into the priority queue under the condition that the information entropy of the rule hit training sample in the first rule set is larger than a set threshold value;
the first comparison unit is used for comparing the rule number in the priority queue with the preset number to obtain a comparison result;
and the abnormal value detection model generation subunit is used for taking a forest formed by the trained decision tree as an abnormal value detection model under the condition that the comparison result is that the rule number in the priority queue is smaller than or equal to the preset number.
As an alternative embodiment, the apparatus further comprises:
the rule re-extraction unit is used for re-extracting the rules in the priority queue to obtain extraction rules under the condition that the comparison result is that the number of the rules in the priority queue is larger than the preset number;
a first sub-sample determining unit configured to determine a first sub-sample based on an extraction rule, where the first sub-sample includes a sample covered by a leaf node in a trained decision tree;
a new single decision tree training unit for training a new single decision tree based on the extraction rule and the first subsamples;
the new single decision tree decomposition unit is used for decomposing the trained new single decision tree to obtain a decomposed second rule set;
the pure entropy queue obtaining unit is further configured to place the rule into the pure entropy queue when the information entropy of the rule hit first sub-sample in the second rule set is less than or equal to a set threshold;
the priority queue obtaining unit is further configured to place the rule into the priority queue when the information entropy of the rule hit first sub-sample in the second rule set is greater than a set threshold;
and the return execution unit is used for returning and executing the step of comparing the rule number in the priority queue with the preset number to obtain a comparison result.
As an alternative embodiment, the outlier acquiring unit 302 includes: the abnormal value acquisition subunit is used for acquiring a third rule set of the sample hit pure entropy queue to be predicted;
comparing the average response rate of all the rule hit training samples in the third rule set with the average response rate of the training samples to obtain a comparison result;
when the comparison result is that the average response rate of the rule hit training samples in the third rule set is larger than the average response rate of the training samples, adding the length of the rule on the basis of the preset score, and updating the preset score;
subtracting the length of the rule on the basis of the preset score and updating the preset score under the condition that the average response rate of the rule hit training samples in the third rule set is smaller than or equal to the average response rate of the training samples as a comparison result;
and under the condition that all rules in the third rule set complete the comparison of the average response rate of hitting the training sample and the average response rate of the training sample, determining the preset score as an abnormal value of the sample to be predicted.
As an alternative embodiment, the apparatus further comprises:
the compliance detection unit is used for detecting whether the training sample input by the user and the label corresponding to the training sample are in compliance or not;
A single decision tree training unit comprising: and the single decision tree training subunit is used for training the single decision tree based on the training sample and the label input by the user under the condition that the training sample input by the user and the label corresponding to the training sample are detected to be compliant.
As an alternative embodiment, a single decision tree decomposition unit comprises: a single decision tree decomposition subunit, configured to decompose from the trained root node of the single decision tree to obtain a path from the root node of the single decision tree to each leaf node;
and determining the path from the root node of the single decision tree to each leaf node as a plurality of rules decomposed by the single decision tree, wherein the plurality of rules form the first rule set.
As an alternative embodiment, the apparatus further comprises:
the decision tree parameter acquisition unit is used for acquiring decision tree parameters input by a user;
and the single decision tree training unit is also used for training the single decision tree based on training samples, labels and decision tree parameters input by the user.
It should be noted here that the above units are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. The above-described units may be implemented in software or hardware as part of the apparatus.
According to another aspect of the embodiment of the present application, there is also provided an electronic device for implementing the above-mentioned outlier detection method.
FIG. 4 is a block diagram of an electronic device, as shown in FIG. 4, that may include one or more processors 401 (only one shown in FIG. 4), a communication interface 402, a memory 403, and a communication bus 404, wherein the processors 401, communication interface 402, memory 403 complete communication with each other over communication bus 404, in accordance with an embodiment of the present application;
a memory 403 for storing a computer program;
the processor 401 is configured to implement the steps of the method embodiment described above when executing the program stored in the memory 403.
The buses mentioned by the server may be peripheral component interconnect standard (Peripheral Component Interconnect, PCI) buses or extended industry standard architecture (Extended Industry Standard Architecture, EISA) buses, etc. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 402 is used for communication between the electronic device and other devices described above.
The Memory 403 may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor 401 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
It will be appreciated by those of ordinary skill in the art that the structure shown in fig. 4 is merely illustrative and not limiting of the structure of the server described above. For example, the server may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps of any of the method embodiments described above.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An outlier detection method, the method comprising:
obtaining a sample to be predicted input by a user;
inputting the sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value characterizes the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels.
2. The method of claim 1, wherein generating the outlier detection model comprises:
acquiring a training sample input by a user, wherein the training sample has a corresponding label;
training a single decision tree based on training samples and labels input by a user;
decomposing the trained single decision tree to obtain a decomposed first rule set;
under the condition that the information entropy of the rule hit training sample in the first rule set is smaller than or equal to a set threshold value, putting the rule into a pure entropy queue;
placing the rule into a priority queue under the condition that the information entropy of the rule hit training sample in the first rule set is larger than a set threshold value;
Comparing the rule number in the priority queue with a preset number to obtain a comparison result;
and taking a forest formed by the trained decision tree as an abnormal value detection model under the condition that the comparison result is that the rule number in the priority queue is smaller than or equal to the preset number.
3. The method according to claim 2, wherein the method further comprises:
re-extracting the rules in the priority queue to obtain extraction rules when the comparison result is that the number of the rules in the priority queue is larger than the preset number;
determining a first sub-sample based on the extraction rule, wherein the first sub-sample comprises a sample covered by leaf nodes in a trained decision tree;
training a new single decision tree based on the extraction rule and the first subsamples;
decomposing the trained new single decision tree to obtain a decomposed second rule set;
under the condition that the information entropy of the rule hit first subsamples in the second rule set is smaller than or equal to a set threshold value, placing the rule into a pure entropy queue;
placing the rule in a priority queue under the condition that the information entropy of the rule hit first sub-sample in the second rule set is larger than a set threshold value;
And returning to the step of comparing the rule number in the priority queue with a preset number to obtain a comparison result.
4. The method according to claim 2, wherein the inputting the sample to be predicted into an outlier detection model to obtain an outlier of the sample to be predicted comprises:
acquiring a third rule set of a sample hit pure entropy queue to be predicted;
comparing the average response rate of all the rule hit training samples in the third rule set with the average response rate of the training samples to obtain a comparison result;
when the comparison result is that the average response rate of the rule hit training samples in the third rule set is larger than the average response rate of the training samples, adding the length of the rule on the basis of a preset score, and updating the preset score;
subtracting the length of the rule on the basis of a preset score and updating the preset score under the condition that the average response rate of the rule hit training samples in the third rule set is smaller than or equal to the average response rate of the training samples as the comparison result;
and under the condition that all rules in the third rule set complete the comparison of the average response rate of hitting the training sample and the average response rate of the training sample, determining the preset score as the abnormal value of the sample to be predicted.
5. The method of claim 2, wherein prior to training the single decision tree based on the user input training samples and labels, the method further comprises:
detecting whether the training sample input by the user and the label corresponding to the training sample are in compliance or not;
training of a single decision tree based on training samples and labels entered by a user, comprising:
and under the condition that the training samples input by the user and the labels corresponding to the training samples are detected to be compliant, training the single decision tree based on the training samples input by the user and the labels.
6. The method of claim 2, wherein decomposing the trained single decision tree to obtain a decomposed first rule set comprises:
decomposing from the root node of the trained single decision tree to obtain the path from the root node of the single decision tree to each leaf node;
and determining the path from the root node of the single decision tree to each leaf node as a plurality of rules decomposed by the single decision tree, wherein the plurality of rules form the first rule set.
7. The method according to claim 2, wherein the method further comprises:
Acquiring decision tree parameters input by a user;
the training of the single decision tree based on the training sample and the label input by the user comprises the following steps:
training of the single decision tree is performed based on training samples, labels and decision tree parameters entered by the user.
8. An abnormal value detection apparatus, characterized in that the apparatus comprises:
the sample to be predicted obtaining unit is used for obtaining a sample to be predicted input by a user;
the abnormal value acquisition unit is used for inputting the sample to be predicted into an abnormal value detection model to obtain an abnormal value of the sample to be predicted, wherein the abnormal value represents the abnormal condition of the sample to be predicted; the outlier detection model is obtained by model training based on training samples input by a user, and the training samples have corresponding labels.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs the steps of the outlier detection method according to any of the preceding claims 1-7 by means of the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the outlier detection method according to any one of claims 1 to 7.
CN202311088730.7A 2023-08-25 2023-08-25 Outlier detection method, outlier detection device, electronic equipment and storage medium Pending CN117150294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311088730.7A CN117150294A (en) 2023-08-25 2023-08-25 Outlier detection method, outlier detection device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311088730.7A CN117150294A (en) 2023-08-25 2023-08-25 Outlier detection method, outlier detection device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117150294A true CN117150294A (en) 2023-12-01

Family

ID=88903835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311088730.7A Pending CN117150294A (en) 2023-08-25 2023-08-25 Outlier detection method, outlier detection device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117150294A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540328A (en) * 2024-01-09 2024-02-09 山西众诚安信安全科技有限公司 Noise processing method in high-precision measurement process of coal mine noise

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540328A (en) * 2024-01-09 2024-02-09 山西众诚安信安全科技有限公司 Noise processing method in high-precision measurement process of coal mine noise
CN117540328B (en) * 2024-01-09 2024-04-02 山西众诚安信安全科技有限公司 Noise processing method in high-precision measurement process of coal mine noise

Similar Documents

Publication Publication Date Title
US10785241B2 (en) URL attack detection method and apparatus, and electronic device
CN110311902B (en) Abnormal behavior identification method and device and electronic equipment
CN108200034B (en) Method and device for identifying domain name
CN110175851B (en) Cheating behavior detection method and device
CN111914253B (en) Method, system, equipment and readable storage medium for intrusion detection
CN112492059A (en) DGA domain name detection model training method, DGA domain name detection device and storage medium
CN109753987B (en) File recognition method and feature extraction method
CN109918498B (en) Problem warehousing method and device
WO2019223104A1 (en) Method and apparatus for determining event influencing factors, terminal device, and readable storage medium
CN117150294A (en) Outlier detection method, outlier detection device, electronic equipment and storage medium
CN111368289B (en) Malicious software detection method and device
CN105072214A (en) C&C domain name identification method based on domain name feature
CN112437053A (en) Intrusion detection method and device
CN110012124B (en) Method and device for splitting network address range segment
CN116756327B (en) Threat information relation extraction method and device based on knowledge inference and electronic equipment
CN114297665A (en) Intelligent contract vulnerability detection method and device based on deep learning
CN111523322A (en) Requirement document quality evaluation model training method and requirement document quality evaluation method
CN110929506A (en) Junk information detection method, device and equipment and readable storage medium
KR101863569B1 (en) Method and Apparatus for Classifying Vulnerability Information Based on Machine Learning
CN110225025B (en) Method and device for acquiring abnormal network data behavior model, electronic equipment and storage medium
CN111625817B (en) Abnormal user identification method, device, electronic equipment and storage medium
CN112183622A (en) Method, device, equipment and medium for detecting cheating in mobile application bots installation
CN115718696B (en) Source code cryptography misuse detection method and device, electronic equipment and storage medium
CN114266300B (en) Feature prediction model training method and device and core network service anomaly detection method and device
CN111881288B (en) Method and device for judging true and false of stroke information, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination