CN111046892A - Abnormity identification method and device - Google Patents
Abnormity identification method and device Download PDFInfo
- Publication number
- CN111046892A CN111046892A CN201811188129.4A CN201811188129A CN111046892A CN 111046892 A CN111046892 A CN 111046892A CN 201811188129 A CN201811188129 A CN 201811188129A CN 111046892 A CN111046892 A CN 111046892A
- Authority
- CN
- China
- Prior art keywords
- sample
- sample point
- points
- sample points
- scoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an abnormality identification method and device, and relates to the technical field of computers. Wherein, the method comprises the following steps: determining the distance between each sample point in the target sample set and other sample points; constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees; and scoring each sample point according to the Hoffman forest, and identifying abnormal sample points according to a scoring result. Through the steps, the abnormal sample point can be rapidly and accurately identified.
Description
Technical Field
The invention relates to the technical field of computers, in particular to an abnormality identification method and device.
Background
Machine learning is classified into supervised learning, semi-supervised learning, and unsupervised learning. Supervised learning refers to the process of adjusting the parameters of a classifier to achieve desired performance using a set of samples of known classes. Algorithms suitable for supervised learning include SVM (support vector machine algorithm), RF (random forest algorithm), KNN (K-nearest neighbor algorithm), and the like. Semi-supervised learning is a learning method combining supervised learning and unsupervised learning, and mainly considers the problem of how to train and classify by using a small amount of labeled samples and a large amount of unlabeled samples. Algorithms suitable for semi-supervised learning include SOM (self-organizing map algorithm) and the like. Unsupervised learning refers to solving various problems in pattern recognition from training samples whose classes are unknown (not labeled). Algorithms suitable for unsupervised learning include LOF (local outlier detection algorithm), ORCA algorithm, iForest (isolated forest algorithm), Mass algorithm, and the like. In practical scenarios, supervised learning is not available because it is difficult to label all sample data, especially big data, manually. Semi-supervised learning is not highly available because abnormal samples are likely to be incomplete. Therefore, the abnormality recognition is mainly performed through unsupervised learning in the actual scene.
In the process of implementing the invention, the inventor finds that the existing unsupervised learning algorithms for anomaly identification have advantages and disadvantages. For example, compared with algorithms such as ORCA and LOF, the iForest algorithm has the advantage of fast calculation speed, but has the following problems: 1) the algorithm is established on the basis that all characteristic fields are continuous variables, and is not suitable for abnormal identification of data containing classified variables; 2) are not suitable for processing high dimensional data; 3) it is not good at handling local relatively sparse points.
Disclosure of Invention
In view of this, the present invention provides a new method and apparatus for identifying an abnormal sample point, which can identify the abnormal sample point quickly and accurately.
To achieve the above object, according to one aspect of the present invention, there is provided an abnormality identification method.
The abnormality recognition method of the present invention includes: determining the distance between each sample point in the target sample set and other sample points; constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees; and scoring each sample point according to the Hoffman forest, and identifying abnormal sample points according to a scoring result.
Optionally, the step of identifying abnormal sample points according to the scoring result includes: and marking the scoring result and the first threshold value, and taking the sample point with the scoring result larger than the first threshold value as an abnormal sample point.
Optionally, the step of scoring each sample point according to the huffman forest comprises: determining a first score value of the sample point to be scored according to the coding length of other sample points in a Huffman tree of the sample point to be scored; determining a second score value of the sample point to be scored according to the coding length of the sample point to be scored in the Huffman trees of other sample points; and determining the scoring result of the sample point to be scored according to the first scoring value and the second scoring value.
Optionally, determining the first score value, the second score value and the scoring result according to the following formula;
wherein, Scorei_1A first score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Lj,iRepresents the coding length of the jth other sample point in the Huffman tree of the sample point to be scored, Scorei_2A second score value, L, representing the ith sample pointi,jRepresents the coding length of the ith sample point in the Huffman tree of the jth other sample point, ScoreiThe scoring results for the ith sample point are shown.
Optionally, the step of constructing a corresponding huffman tree for each sample point according to the distance distribution between each sample point and other sample points includes: and carrying out normalization processing and divisor processing on the distance between each sample point and other sample points, then counting the number of sample points corresponding to each distance after the divisor processing, and constructing a corresponding Huffman tree for each sample point according to the counting result.
Optionally, the method further comprises: under the condition that the sample capacity of the obtained initial sample set is larger than a second threshold value, performing hierarchical sampling on the initial sample set to obtain a plurality of target sample sets; and taking the initial sample set as the target sample set when the sample capacity of the initial sample set is less than or equal to a second threshold value.
To achieve the above object, according to another aspect of the present invention, there is provided an abnormality recognition apparatus.
The abnormality recognition device of the present invention includes: the determining module is used for determining the distance between each sample point in the target sample set and other sample points; the construction module is used for constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points so as to obtain a Huffman forest consisting of a plurality of Huffman trees; and the identification module is used for scoring each sample point according to the Hoffman forest and identifying abnormal sample points according to a scoring result.
Optionally, the identifying, by the identifying module, the abnormal sample point according to the scoring result includes: the identification module compares the scoring result with a first threshold value, and takes the sample points with the scoring results larger than the first threshold value as abnormal sample points.
Optionally, the scoring each sample point according to the huffman forest by the recognition module comprises: the identification module determines a first score value of the sample point to be scored according to the coding length of other sample points in a Huffman tree of the sample point to be scored; the identification module determines a second scoring value of the sample point to be scored according to the coding length of the sample point to be scored in the Huffman trees of other sample points; and the identification module determines the grading result of the sample point to be graded according to the first grading value and the second grading value.
Optionally, the identification module determines a first score value, a second score value and a scoring result according to the following formula;
wherein, Scorei_1A first score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Lj,iRepresents the coding length of the jth other sample point in the Huffman tree of the sample point to be scored, Scorei_2A second score value, L, representing the ith sample pointi,jRepresents the coding length of the ith sample point in the Huffman tree of the jth other sample point, ScoreiThe scoring results for the ith sample point are shown.
Optionally, the constructing module constructs, according to a distance distribution between each sample point and another sample point, a corresponding huffman tree for each sample point, including: the construction module performs normalization processing and divisor processing on the distance between each sample point and other sample points, then counts the number of sample points corresponding to each distance after divisor processing, and constructs a corresponding Huffman tree for each sample point according to the statistical result.
Optionally, the apparatus further comprises: the sampling module is used for performing hierarchical sampling on the initial sample set to obtain a plurality of target sample sets under the condition that the sample capacity of the obtained initial sample set is greater than a second threshold value; and the processor is further configured to take the initial sample set as the target sample set if the sample capacity of the initial sample set is less than or equal to a second threshold.
To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.
The electronic device of the present invention includes: one or more processors; and storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the anomaly identification method of the present invention.
To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable medium.
The computer-readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the abnormality recognition method of the present invention.
One embodiment of the above invention has the following advantages or benefits: the invention realizes a brand-new anomaly identification method, which can quickly and accurately identify the anomalous sample points by determining the distance between each sample point and other sample points in a target sample set, constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees, and rating each sample point according to the Huffman forest.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of an anomaly identification method according to one embodiment of the present invention;
FIG. 2 is a schematic diagram of the main steps of an abnormality recognition method according to another embodiment of the present invention;
FIG. 3a is a schematic diagram of a sample point distribution according to an embodiment of the present invention;
FIG. 3b is a schematic diagram of a Huffman tree according to an embodiment of the present invention;
FIG. 4a is a second schematic diagram of a sample point distribution according to an embodiment of the present invention;
FIG. 4b is a second schematic diagram of a Huffman tree according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the main blocks of an anomaly identification apparatus according to one embodiment of the present invention;
FIG. 6 is a schematic diagram of the main modules of an abnormality recognition apparatus according to another embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 8 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention provides a novel anomaly identification method and a novel anomaly identification device, which can be applied to various scenes, such as credit card fraud detection, network attack detection, order freight detection and the like.
Fig. 1 is a schematic diagram of main steps of an abnormality identification method according to an embodiment of the present invention. As shown in fig. 1, the method for identifying an abnormality according to the embodiment of the present invention includes:
and step S101, determining the distance between each sample point and other sample points in the target sample set.
In this step, the distance of each sample point from other sample points may be determined from the feature data of the sample point. Wherein the feature data may comprise a plurality of feature fields, which may be continuity data, classification data or vector data. For example, "test score" is continuity data, gender (male or female) is classification data, and the feature field represented by the vector is vector data. Further, the feature data for a sample point may include continuity data, classification data, and/or vector data.
For example, when the feature data of the sample points is continuity data, the distance between the sample points may be calculated according to the euclidean distance formula. Specifically, the euclidean distance formula may be expressed as:
where d (x, y) represents the distance between sample point x and sample point y, xkK-th feature field, y, representing sample point xkThe kth feature field representing sample point y, and m represents the total number of feature fields. In addition, when the feature data of the sample points is continuity data, the distance between the sample points may also be calculated according to the mahalanobis distance formula or the manhattan distance formula.
For example, when the feature data of the sample points is classification data, the distance between the sample points may be customized. For example, for the gender field, if the genders of two sample points are the same, the distance may be defined as 0; if the gender of the two sample points is different, the distance may be defined as 1.
For example, when the feature data of the sample points is vector data, the distance between the sample points may be calculated according to a cosine similarity formula.
For example, when the feature data of the sample points is mixed data, the distances may be calculated based on feature field sets of the same category (such as a feature field set belonging to continuity data, a feature field set belonging to classification data, and a feature field set belonging to vector data), and then the distances under the feature field sets of all categories are weighted and summed to obtain the distance between the sample points.
And S102, constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees.
For example, assuming that the target sample set has 99 sample points, 99 huffman trees may be constructed according to step S102, resulting in a huffman forest consisting of 99 huffman trees. A huffman tree (or called "huffman tree"), also called an optimal binary tree, is a binary tree with the shortest weighted path length.
And S103, scoring each sample point according to the Hoffman forest, and identifying abnormal sample points according to scoring results.
For example, after the scoring result of each sample point is obtained, the scoring result of each sample point may be compared with a first threshold value, and the sample point having the scoring result greater than the first threshold value may be regarded as an abnormal sample point. The first threshold value can be preset according to experience, and can also be set in real time according to the distribution condition of the grading results of all the sample points.
In the embodiment of the invention, a new abnormity identification method is realized through the steps, and the abnormal sample points can be quickly and accurately identified. Furthermore, the anomaly identification method provided by the embodiment of the invention is based on the Huffman tree constructed based on the distance, so that the anomaly identification based on classified data or mixed data can be supported, and the anomaly identification method is suitable for processing high-dimensional data.
Fig. 2 is a schematic diagram of main steps of an abnormality identification method according to another embodiment of the present invention. As shown in fig. 2, the method for identifying an abnormality according to the embodiment of the present invention includes:
step S201, preprocessing the characteristic data of each sample point in the target sample set.
Illustratively, when the feature data of the sample point includes continuity data, the preprocessing of the continuity data includes: the feature data of each sample point is normalized. In an alternative embodiment, the normalization process may be performed according to the Z-Score normalization (just too normalized) formula:
wherein, x'kK-th feature field, x, representing normalized sample point xkDenotes the kth feature field of the sample point x before normalization processing, μ denotes the mean of the kth feature fields of all sample points, and σ denotes the standard deviation of the kth feature fields of all sample points.
It should be noted that step S201 is an optional step. In another embodiment, the preprocessing step may not be performed, and the subsequent processing may be performed directly based on the feature data of the sample points.
Further, before step S201, the method of the embodiment of the present invention may further include the following steps: under the condition that the sample capacity of the obtained initial sample set is larger than a second threshold value, performing hierarchical sampling on the initial sample set to obtain a plurality of target sample sets; and taking the initial sample set as the target sample set when the sample capacity of the initial sample set is less than or equal to a second threshold value. Through the steps, the problem of time consumption of calculation possibly caused by excessive number of sample points can be effectively solved, and the anomaly identification efficiency is further improved; through the steps, the problem that the identification accuracy rate is reduced due to the fact that abnormal sample points are possibly gathered can be effectively solved, and the abnormal identification accuracy rate is further improved.
And S202, calculating the distance between each sample point and other sample points based on the preprocessed feature data.
For example, when the feature data of the sample points is continuity data, the distance between the sample points may be calculated according to the euclidean distance formula, or may be calculated according to the mahalanobis distance formula or the manhattan distance formula.
For example, when the feature data of the sample points is classification data, the distance between the sample points may be customized. For example, for the gender field, if the genders of two sample points are the same, the distance may be defined as 0; if the gender of the two sample points is different, the distance may be defined as 1.
For example, when the feature data of the sample points is vector data, the distance between the sample points may be calculated according to a cosine similarity formula.
For example, when the feature data of the sample points is mixed data, the distances may be calculated based on feature field sets of the same category (such as a feature field set belonging to continuity data, a feature field set belonging to classification data, and a feature field set belonging to vector data), and then the distances under the feature field sets of all categories are weighted and summed to obtain the distance between the sample points.
Step S203, normalization processing and divisor processing are performed on the distance between each sample point and the other sample points.
In this step, the distance of the sample points may be normalized according to the following formula:
wherein d isi',jRepresents the distance between the normalized ith sample point and the jth sample point, di,jRepresents the distance between the ith sample point and the jth sample point before normalization,representing the sum of the distances of the ith sample point and the N other sample points.
Then, the normalized distance is subjected to a divisor processing. The divisor processing is to be understood as that only a preset number of significant digits are reserved for the normalized distance. For example, if the number of sample points in the target sample set is less than 100, only two significant digits after the decimal point can be reserved for the normalized distance; assuming that the number of sample points in the target sample set is between 100 and 999, the normalized distance can be retained only by the three significant digits after the decimal point.
And S204, counting the number of sample points corresponding to each distance after the divisor processing, and constructing a corresponding Huffman tree for each sample point according to the counting result so as to obtain a Huffman forest consisting of a plurality of Huffman trees.
For example, assuming that the target sample set has 99 sample points, 99 huffman trees may be constructed according to step S102, resulting in a huffman forest consisting of 99 huffman trees.
Step S205, determining a first score value of the sample point to be scored according to the coding length of other sample points in the Huffman tree of the sample point to be scored.
In an alternative embodiment, the first scoring value for a sample point to be scored may be determined according to the following formula:
wherein, Scorei_1A first score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Lj,iIndicating the coding length of the jth other sample point in the huffman tree of the sample point to be scored,represents a pair Lj,iA summation operation is performed with j equal to 1,2, … N.
And step S206, determining a second scoring value of the sample point to be scored according to the coding length of the sample point to be scored in the Huffman trees of other sample points.
In an alternative embodiment, the second scoring value for a sample point to be scored may be determined according to the following formula:
wherein, Scorei_2A second score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Li,jIndicating the coding length of the ith sample point in the huffman tree of the jth other sample point,represents a pair Li,jA summation operation is performed with j equal to 1,2, … N.
Step S207, determining a grading result of the sample points to be graded according to the first grading value and the second grading value, and identifying abnormal sample points according to the grading result.
In an alternative embodiment, the scoring result of the sample points to be scored may be determined according to the following formula:
wherein, ScoreiScore results for the ith sample Pointi_2Represents the second Score value, Score, of the ith sample pointi_1Represents the first score value for the ith sample point, and log represents the logarithmic operator.
The inventor researches and discovers that, in general, the depth (or called path length) of a huffman tree formed by normal sample points is larger, and the depth of the huffman tree formed by abnormal sample points is smaller; the depth of the normal sample point in the huffman trees of the other sample points is smaller, and the depth of the abnormal sample point in the huffman trees of the other sample points is larger. In view of this, the larger the score result for each sample point, the more likely there is an anomaly for that sample point.
In addition to obtaining the scoring results of all the sample points, the scoring result of each sample point may be compared with a first threshold, and the sample point having the scoring result greater than the first threshold may be used as an abnormal sample point. The first threshold value can be preset according to experience, and can also be set in real time according to the distribution condition of the grading results of all the sample points.
In the embodiment of the invention, the abnormal sample point can be quickly and accurately identified through the steps. Furthermore, the anomaly identification method provided by the embodiment of the invention is based on the Huffman tree constructed based on the distance, so that the anomaly identification based on classified data or mixed data can be supported, and the anomaly identification method is suitable for processing high-dimensional data.
The huffman tree constructed by the present invention will be described in detail with reference to fig. 3a, fig. 3b, fig. 4a and fig. 4 b. In this example, the target sample set has 91 sample points, and there are two feature fields for the feature data for each sample point.
Fig. 3a is a schematic diagram of a sample point distribution according to an embodiment of the invention. In fig. 3a, the abscissa of each sample point represents the value of one feature field, and the ordinate of each sample point represents the value of another feature field. Fig. 3b is a schematic view of a huffman tree according to an embodiment of the present invention, which is a schematic view of the huffman tree at the sample point a enclosed in the circle in fig. 3 a. In fig. 3b, 1(0.00) in the node indicates 1 sample point having a distance of 0.00 after the divisor processing from the sample point a, and 8(0.02) in the node indicates 8 sample points having a distance of 0.02 after the divisor processing from the sample point a.
FIG. 4a is a second schematic diagram of a sample point distribution according to an embodiment of the present invention. In fig. 4a, the abscissa of each sample point represents the value of one feature field, and the ordinate of each sample point represents the value of another feature field. Fig. 4b is a schematic view of a huffman tree according to an embodiment of the present invention, which is a schematic view of the huffman tree at the sample point b in the circle in fig. 4 a. In fig. 4b, 16(0.00) of the nodes indicates 16 sample points whose distance from the sample point b after the divisor processing is 0.00, and 1(0.03) of the nodes indicates 1 sample point whose distance from the sample point b after the divisor processing is 0.03.
Fig. 5 is a schematic diagram of main blocks of an abnormality recognition apparatus according to an embodiment of the present invention. As shown in fig. 5, an abnormality recognition apparatus 500 according to an embodiment of the present invention includes: a determination module 501, a construction module 502 and an identification module 503.
A determining module 501, configured to determine distances between each sample point in the target sample set and other sample points.
Specifically, the determining module 501 may determine the distance of each sample point from other sample points according to the feature data of the sample points. Wherein the feature data may comprise a plurality of feature fields, which may be continuity data, classification data or vector data. For example, "test score" is continuity data, gender (male or female) is classification data, and the feature field represented by the vector is vector data. Further, the feature data for a sample point may include continuity data, classification data, and/or vector data.
For example, when the feature data of the sample points is continuity data, the determining module 501 may calculate the distance between the sample points according to the euclidean distance formula. Specifically, the euclidean distance formula may be expressed as:
where d (x, y) represents the distance between sample point x and sample point y, xkK-th feature field, y, representing sample point xkThe kth feature field representing sample point y, and m represents the total number of feature fields. In addition, when the feature data of the sample points is continuity data, the distance between the sample points may also be calculated according to the mahalanobis distance formula or the manhattan distance formula.
For example, when the feature data of the sample points is classification data, the determining module 501 may customize the distance between the sample points. For example, for the gender field, if the genders of two sample points are the same, the distance may be defined as 0; if the gender of the two sample points is different, the distance may be defined as 1.
For example, when the feature data of the sample points is vector data, the determining module 501 may calculate the distance between the sample points according to a cosine similarity formula.
For example, when the feature data of the sample points is mixed data, the determining module 501 may calculate distances based on feature field sets of the same category (such as a feature field set belonging to continuity data, a feature field set belonging to classification data, and a feature field set belonging to vector data), and then perform weighted summation on the distances under the feature field sets of all categories to obtain the distances between the sample points.
A constructing module 502, configured to construct a corresponding huffman tree for each sample point according to a distance distribution situation between each sample point and another sample point, so as to obtain a huffman forest composed of multiple huffman trees.
For example, assuming that the target sample set has 99 sample points, the construction module 502 may construct 99 huffman trees, resulting in a huffman forest consisting of 99 huffman trees. A huffman tree (or called "huffman tree"), also called an optimal binary tree, is a binary tree with the shortest weighted path length.
And the identification module 503 is configured to score each sample point according to the huffman forest, and identify an abnormal sample point according to a scoring result.
For example, after the identification module 503 obtains the score result of each sample point, the score result of each sample point may be compared with a first threshold, and the sample point with the score result greater than the first threshold may be used as the abnormal sample point. The first threshold value can be preset according to experience, and can also be set in real time according to the distribution condition of the grading results of all the sample points.
In the embodiment of the invention, the abnormal sample point can be quickly and accurately identified through the device. Further, the anomaly identification device of the embodiment of the invention is a Huffman tree constructed based on the distance, so that the anomaly identification device can support anomaly identification based on classified data or mixed data and is suitable for processing high-dimensional data. In addition, the anomaly identification method provided by the embodiment of the invention is sensitive to global and local sparse points.
Fig. 6 is a schematic diagram of main blocks of an abnormality recognition apparatus according to another embodiment of the present invention. As shown in fig. 6, an abnormality recognition apparatus 600 according to an embodiment of the present invention includes: a sampling module 601, a determination module 602, a construction module 603, and an identification module 604.
The sampling module 601 is configured to perform hierarchical sampling on the initial sample set to obtain a plurality of target sample sets when the sample capacity of the obtained initial sample set is greater than a second threshold. The sampling module 601 is further configured to take the initial sample set as a target sample set when the sample capacity of the initial sample set is less than or equal to a second threshold.
In the embodiment of the invention, by arranging the sampling module, the problem of time consumption caused by excessive number of sample points can be effectively solved, and the abnormity identification efficiency is improved; by arranging the sampling module, the problem that the identification accuracy rate is reduced due to the possible aggregation of abnormal sample points can be effectively solved, and the abnormal identification accuracy rate is improved.
A determining module 602, configured to determine distances between each sample point in the target sample set and other sample points. In particular, the determining module 602 may determine the distance of each sample point from other sample points based on the feature data of the sample points.
For example, when the feature data of the sample points is continuity data, the distance between the sample points may be calculated according to the euclidean distance formula, or may be calculated according to the mahalanobis distance formula or the manhattan distance formula.
For example, when the feature data of the sample points is classification data, the distance between the sample points may be customized. For example, for the gender field, if the genders of two sample points are the same, the distance may be defined as 0; if the gender of the two sample points is different, the distance may be defined as 1.
For example, when the feature data of the sample points is vector data, the distance between the sample points may be calculated according to a cosine similarity formula.
For example, when the feature data of the sample points is mixed data, the distances may be calculated based on feature field sets of the same category (such as a feature field set belonging to continuity data, a feature field set belonging to classification data, and a feature field set belonging to vector data), and then the distances under the feature field sets of all categories are weighted and summed to obtain the distance between the sample points.
Further, before the determining module 602 determines the distance between each sample point in the target sample set and other sample points, the feature data of each sample point in the target sample set may be preprocessed.
Illustratively, when the feature data of the sample point includes continuity data, the preprocessing of the continuity data includes: the feature data of each sample point is normalized. In an alternative embodiment, the normalization process may be performed according to the Z-Score normalization (just too normalized) formula:
wherein, x'kK-th feature field, x, representing normalized sample point xkK-th sample point x before normalizationAnd a characteristic field, μ represents a mean value of the kth characteristic fields of all the sample points, and σ represents a standard deviation of the kth characteristic fields of all the sample points.
The constructing module 603 is configured to construct a corresponding huffman tree for each sample point according to a distribution of distances between each sample point and other sample points, so as to obtain a huffman forest composed of multiple huffman trees.
In an alternative embodiment, the constructing module 603 may construct a huffman tree for each sample point according to the distance distribution between each sample point and other sample points, where the constructing module may include: the constructing module 603 constructs a huffman tree for each sample point according to the distance distribution between each sample point and other sample points, including: the construction module 603 performs normalization processing and divisor processing on the distance between each sample point and other sample points, then counts the number of sample points corresponding to each distance after divisor processing, and constructs a corresponding huffman tree for each sample point according to the statistical result.
In the above alternative embodiment, the construction module 603 may normalize the distance of the sample points according to the following formula:
wherein d isi',jRepresents the distance between the normalized ith sample point and the jth sample point, di,jRepresents the distance between the ith sample point and the jth sample point before normalization,representing the sum of the distances of the ith sample point and the N other sample points.
And the identification module 604 is configured to score each sample point according to the huffman forest, and identify an abnormal sample point according to a scoring result.
In an alternative embodiment, the scoring each sample point according to the huffman forest by the identification module 604 may comprise: the identification module 604 determines a first score value of the sample point to be scored according to the coding length of other sample points in the huffman tree of the sample point to be scored; the identification module 604 determines a second scoring value of the sample point to be scored according to the coding length of the sample point to be scored in the huffman trees of other sample points; the identification module 604 determines the scoring result of the sample point to be scored according to the first scoring value and the second scoring value.
Further, the identification module 604 may determine the first score value, the second score value, and the scoring result according to the following formulas;
wherein, Scorei_1A first score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Lj,iRepresents the coding length of the jth other sample point in the Huffman tree of the sample point to be scored, Scorei_2A second score value, L, representing the ith sample pointi,jRepresents the coding length of the ith sample point in the Huffman tree of the jth other sample point, ScoreiThe scoring results for the ith sample point are shown.
In addition to obtaining the scoring results of all the sample points, the identifying module 604 may compare the scoring result of each sample point with the first threshold, and determine the sample point with the scoring result greater than the first threshold as the abnormal sample point. The first threshold value can be preset according to experience, and can also be set in real time according to the distribution condition of the grading results of all the sample points.
In the embodiment of the invention, the abnormal sample point can be quickly and accurately identified through the device. Furthermore, the anomaly identification method provided by the embodiment of the invention is based on the Huffman tree constructed based on the distance, so that the anomaly identification based on classified data or mixed data can be supported, and the anomaly identification method is suitable for processing high-dimensional data. In addition, the anomaly identification device provided by the embodiment of the invention is sensitive to global and local sparse points.
Fig. 7 shows an exemplary system architecture 700 to which the anomaly identification method or the anomaly identification apparatus of the embodiments of the present invention can be applied.
As shown in fig. 7, the system architecture 700 may include a client 701, a network 702, and a server 703. The network 702 is used to provide a medium for communication links between the clients 701 and the servers 703. The network 702 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The client 701 may be a terminal device, or may be a server running one or more applications to provide an anomaly identification service to a user.
The server 703 may be a server that provides various services, such as a background management server that supports user requests for anomaly identification issued by the client 701. The background management server can analyze and process the received data such as the abnormal recognition request and feed back the processing result to the client.
It should be noted that the abnormality recognition method provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the abnormality recognition apparatus is generally disposed in the server 705.
It should be understood that the number of clients, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
FIG. 8 illustrates a schematic block diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the invention. The computer system illustrated in FIG. 8 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a determination module, a construction module, and an identification module. Where the names of these modules do not in some cases constitute a limitation of the module itself, for example, a determination module may also be described as a "module that determines the distance of each sample point from other sample points in the target sample set".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the following: determining the distance between each sample point in the target sample set and other sample points; constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees; and scoring each sample point according to the Hoffman forest, and identifying abnormal sample points according to a scoring result.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (14)
1. An anomaly identification method, characterized in that the method comprises:
determining the distance between each sample point in the target sample set and other sample points;
constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points to obtain a Huffman forest consisting of a plurality of Huffman trees;
and scoring each sample point according to the Hoffman forest, and identifying abnormal sample points according to a scoring result.
2. The method of claim 1, wherein the step of identifying abnormal sample points based on the scoring comprises:
and comparing the scoring result with a first threshold value, and taking the sample points with the scoring results larger than the first threshold value as abnormal sample points.
3. The method of claim 1, wherein the step of scoring each sample point according to the huffman forest comprises:
determining a first score value of the sample point to be scored according to the coding length of other sample points in a Huffman tree of the sample point to be scored; determining a second score value of the sample point to be scored according to the coding length of the sample point to be scored in the Huffman trees of other sample points; and determining the scoring result of the sample point to be scored according to the first scoring value and the second scoring value.
4. The method of claim 3, wherein the first score value, the second score value, and the scoring result are determined according to the following formulas;
wherein, Scorei_1A first score value indicating the ith sample point, i ═ 1,2, … N +1, N indicates the total number of sample points other than the ith sample point, and Lj,iRepresents the coding length of the jth other sample point in the Huffman tree of the sample point to be scored, Scorei_2A second score value, L, representing the ith sample pointi,jRepresents the coding length of the ith sample point in the Huffman tree of the jth other sample point, ScoreiThe scoring results for the ith sample point are shown.
5. The method according to claim 1, wherein the step of constructing a corresponding huffman tree for each sample point according to the distance distribution of each sample point and other sample points comprises: and carrying out normalization processing and divisor processing on the distance between each sample point and other sample points, then counting the number of sample points corresponding to each distance after the divisor processing, and constructing a corresponding Huffman tree for each sample point according to the counting result.
6. The method of any of claims 1 to 5, further comprising:
under the condition that the sample capacity of the obtained initial sample set is larger than a second threshold value, performing hierarchical sampling on the initial sample set to obtain a plurality of target sample sets; and taking the initial sample set as the target sample set when the sample capacity of the initial sample set is less than or equal to a second threshold value.
7. An abnormality recognition apparatus, characterized in that the apparatus comprises:
the determining module is used for determining the distance between each sample point in the target sample set and other sample points;
the construction module is used for constructing a corresponding Huffman tree for each sample point according to the distance distribution condition of each sample point and other sample points so as to obtain a Huffman forest consisting of a plurality of Huffman trees;
and the identification module is used for grading each sample point according to the Hoffman forest and identifying abnormal sample points according to a grading result.
8. The apparatus of claim 7, wherein the identification module identifying abnormal sample points according to the scoring result comprises:
the identification module compares the scoring result with a first threshold value, and takes the sample points with the scoring results larger than the first threshold value as abnormal sample points.
9. The apparatus of claim 7, wherein the identification module scoring each sample point according to the Huffman forest comprises:
the identification module determines a first score value of the sample point to be scored according to the coding length of other sample points in a Huffman tree of the sample point to be scored; the identification module determines a second scoring value of the sample point to be scored according to the coding length of the sample point to be scored in the Huffman trees of other sample points; and the identification module determines the grading result of the sample point to be graded according to the first grading value and the second grading value.
10. The apparatus of claim 9, wherein the identification module determines the first score value, the second score value, and the scoring result according to the following formula;
wherein, Scorei_1A first score value representing the ith sample point, i ═ 1,2, … N +1, N representing samples other than the ith sample pointTotal number of dots, Lj,iRepresents the coding length of the jth other sample point in the Huffman tree of the sample point to be scored, Scorei_2A second score value, L, representing the ith sample pointi,jRepresents the coding length of the ith sample point in the Huffman tree of the jth other sample point, ScoreiThe scoring results for the ith sample point are shown.
11. The apparatus of claim 7, wherein the constructing module constructs a Huffman tree for each sample point according to a distribution of distances between each sample point and other sample points, including: the construction module performs normalization processing and divisor processing on the distance between each sample point and other sample points, then counts the number of sample points corresponding to each distance after divisor processing, and constructs a corresponding Huffman tree for each sample point according to the statistical result.
12. The apparatus of any of claims 7 to 11, further comprising:
the sampling module is used for performing hierarchical sampling on the initial sample set to obtain a plurality of target sample sets under the condition that the sample capacity of the obtained initial sample set is greater than a second threshold value; and the processor is further configured to take the initial sample set as the target sample set if the sample capacity of the initial sample set is less than or equal to a second threshold.
13. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
14. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811188129.4A CN111046892A (en) | 2018-10-12 | 2018-10-12 | Abnormity identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811188129.4A CN111046892A (en) | 2018-10-12 | 2018-10-12 | Abnormity identification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111046892A true CN111046892A (en) | 2020-04-21 |
Family
ID=70229630
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811188129.4A Pending CN111046892A (en) | 2018-10-12 | 2018-10-12 | Abnormity identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111046892A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897695A (en) * | 2020-07-31 | 2020-11-06 | 平安科技(深圳)有限公司 | Method and device for acquiring KPI abnormal data sample and computer equipment |
CN114706794A (en) * | 2022-06-06 | 2022-07-05 | 航天亮丽电气有限责任公司 | Data processing system for production management software |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170199902A1 (en) * | 2016-01-07 | 2017-07-13 | Amazon Technologies, Inc. | Outlier detection for streaming data |
CN107292350A (en) * | 2017-08-04 | 2017-10-24 | 电子科技大学 | The method for detecting abnormality of large-scale data |
CN107315647A (en) * | 2017-06-26 | 2017-11-03 | 广州视源电子科技股份有限公司 | Outlier detection method and system |
CN107357844A (en) * | 2017-06-26 | 2017-11-17 | 广州视源电子科技股份有限公司 | Outlier detection method and device |
CN107657288A (en) * | 2017-10-26 | 2018-02-02 | 国网冀北电力有限公司 | A kind of power scheduling flow data method for detecting abnormality based on isolated forest algorithm |
-
2018
- 2018-10-12 CN CN201811188129.4A patent/CN111046892A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170199902A1 (en) * | 2016-01-07 | 2017-07-13 | Amazon Technologies, Inc. | Outlier detection for streaming data |
CN107315647A (en) * | 2017-06-26 | 2017-11-03 | 广州视源电子科技股份有限公司 | Outlier detection method and system |
CN107357844A (en) * | 2017-06-26 | 2017-11-17 | 广州视源电子科技股份有限公司 | Outlier detection method and device |
CN107292350A (en) * | 2017-08-04 | 2017-10-24 | 电子科技大学 | The method for detecting abnormality of large-scale data |
CN107657288A (en) * | 2017-10-26 | 2018-02-02 | 国网冀北电力有限公司 | A kind of power scheduling flow data method for detecting abnormality based on isolated forest algorithm |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897695A (en) * | 2020-07-31 | 2020-11-06 | 平安科技(深圳)有限公司 | Method and device for acquiring KPI abnormal data sample and computer equipment |
CN111897695B (en) * | 2020-07-31 | 2022-06-17 | 平安科技(深圳)有限公司 | Method and device for acquiring KPI abnormal data sample and computer equipment |
CN114706794A (en) * | 2022-06-06 | 2022-07-05 | 航天亮丽电气有限责任公司 | Data processing system for production management software |
CN114706794B (en) * | 2022-06-06 | 2022-08-30 | 航天亮丽电气有限责任公司 | Data processing system for production management software |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241418B (en) | Abnormal user identification method and device based on random forest, equipment and medium | |
CN108229419B (en) | Method and apparatus for clustering images | |
CN110995459B (en) | Abnormal object identification method, device, medium and electronic equipment | |
CN113222942A (en) | Training method of multi-label classification model and method for predicting labels | |
CN109726391B (en) | Method, device and terminal for emotion classification of text | |
CN109918498B (en) | Problem warehousing method and device | |
CN110147389B (en) | Account processing method and device, storage medium and electronic device | |
WO2023019933A1 (en) | Method and apparatus for constructing search database, and device and storage medium | |
CN112883990A (en) | Data classification method and device, computer storage medium and electronic equipment | |
CN113704389A (en) | Data evaluation method and device, computer equipment and storage medium | |
CN112800919A (en) | Method, device and equipment for detecting target type video and storage medium | |
CN112632609A (en) | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium | |
CN114444619A (en) | Sample generation method, training method, data processing method and electronic device | |
CN113839926A (en) | Intrusion detection system modeling method, system and device based on gray wolf algorithm feature selection | |
CN111046892A (en) | Abnormity identification method and device | |
CN112418258B (en) | Feature discretization method and device | |
CN110751354B (en) | Abnormal user detection method and device | |
CN113392920B (en) | Method, apparatus, device, medium, and program product for generating cheating prediction model | |
CN114037059A (en) | Pre-training model, model generation method, data processing method and data processing device | |
CN113850077A (en) | Topic identification method, device, server and medium based on artificial intelligence | |
CN111444364B (en) | Image detection method and device | |
CN116527399B (en) | Malicious traffic classification method and device based on unreliable pseudo tag semi-supervised learning | |
CN111382760B (en) | Picture category identification method and device and computer readable storage medium | |
CN116342164A (en) | Target user group positioning method and device, electronic equipment and storage medium | |
CN112688897A (en) | Traffic identification method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |