CN114417962A

CN114417962A - Abnormal data detection method, system, device and medium based on K nearest neighbor algorithm

Info

Publication number: CN114417962A
Application number: CN202111491483.6A
Authority: CN
Inventors: 韩雅安; 张文宏; 于岗
Original assignee: Aerospace Science And Technology Network Information Development Co ltd
Current assignee: Aerospace Science And Technology Network Information Development Co ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-29

Abstract

The invention discloses an abnormal data detection method, system, equipment and medium based on a K nearest neighbor algorithm, relates to the technical field of abnormal data detection, and aims to solve the problem that the existing detection method is poor in reusability. The detection method comprises the following steps: and calculating the similarity between the unclassified data and each classified data in the classified data set based on a comprehensive verification rule, determining K neighbors of the unclassified data according to the similarity, determining the category of the unclassified data according to the categories of the K neighbors, and further determining the category of each data in the original data set to be detected based on the similarity so as to determine all abnormal data and realize abnormal data detection. The method used by the invention can be suitable for abnormal data detection in all scenes and has strong universality. The abnormal data detection method, the abnormal data detection system, the abnormal data detection equipment and the abnormal data detection medium based on the K nearest neighbor algorithm are used for optimizing the abnormal data detection method, and have strong reusability.

Description

Abnormal data detection method, system, device and medium based on K nearest neighbor algorithm

Technical Field

The invention relates to the technical field of abnormal data detection, in particular to a method, a system, equipment and a medium for detecting abnormal data based on a K neighbor algorithm.

Background

With the advent of the data age, the demand for managing enterprise data by information technology in various industries is increasing day by day. Many enterprises are built through a front-stage exploration type business system, and it is increasingly understood that the information construction is the construction and management of enterprise data. Data quality management, abnormal data detection and data standardization are important in enterprise data governance. Without good data quality management methods and technologies, production is difficult to realize through data promotion.

The existing data quality detection method has the defects that a specific and complex check rule needs to be formulated for abnormal data detection, the rule is strong in pertinence and poor in reusability, all abnormal conditions cannot be exhausted, and the defect exists. Based on this, a general abnormal data detection method is needed.

Disclosure of Invention

The invention aims to provide a K nearest neighbor algorithm-based abnormal data detection method, a K nearest neighbor algorithm-based abnormal data detection system, K nearest neighbor algorithm-based abnormal data detection equipment and K nearest neighbor algorithm-based abnormal data detection media, which are used for determining abnormal data and have strong universality.

In order to achieve the above purpose, the invention provides the following technical scheme:

an abnormal data detection method based on a K-nearest neighbor algorithm comprises the following steps:

determining normal data and abnormal data in the original data set to be detected based on a preset standard; the normal data and the abnormal data constitute a classified data set;

randomly selecting an unclassified data from the original data set to be detected, and calculating the similarity between the unclassified data and each classified data in the classified data set based on a comprehensive verification rule;

sorting according to the sequence of the similarity from big to small, and selecting classified data corresponding to the first K similarities as K neighbors of the unclassified data;

determining the classification of the unclassified data according to the classification of the K neighbors to obtain classified data, and putting the classified data into the classified data set; the categories include normal data and abnormal data;

judging whether unclassified data exists in the original data set to be detected;

if so, returning to the step of randomly selecting one unclassified data in the original data set to be detected until each unclassified data in the data set to be detected is detected;

if not, determining abnormal data according to the classified data set.

Compared with the prior art, the abnormal data detection method based on the K-nearest neighbor algorithm calculates the similarity between the unclassified data and each classified data in the classified data set based on the comprehensive check rule, determines the K neighbors of the unclassified data according to the similarity, determines the category of the unclassified data according to the category of the K neighbors, further determines the category of each data in the original data set to be detected based on the similarity, so as to determine all abnormal data and realize abnormal data detection. The method used by the invention can be suitable for abnormal data detection in all scenes and has strong universality. In addition, the used check rule is simple, a specific and complex check rule does not need to be formulated for abnormal data detection, and the detection efficiency is high.

An abnormal data detection system based on a K-nearest neighbor algorithm, the detection system comprising:

the classified data set determining module is used for determining normal data and abnormal data in the original data set to be detected based on a preset standard; the normal data and the abnormal data constitute a classified data set;

the similarity calculation module is used for randomly selecting an unclassified data in the original data set to be detected and calculating the similarity between the unclassified data and each classified data in the classified data set based on a comprehensive verification rule;

the neighbor determining module is used for sequencing according to the sequence of the similarity from large to small, and selecting classified data corresponding to the first K similarities as K neighbors of the unclassified data;

the category determining module is used for determining the category of the unclassified data according to the categories of the K neighbors to become classified data and putting the classified data into the classified data set; the categories include normal data and abnormal data;

the judging module is used for judging whether the original data set to be detected has unclassified data or not;

a returning module, configured to return to the step of "randomly selecting an unclassified data in the original data set to be detected" if yes, until each unclassified data in the data set to be detected has been detected;

and the abnormal data determining module is used for determining abnormal data according to the classified data set if the classified data set is not the classified data set.

An abnormal data detection device based on a K-nearest neighbor algorithm comprises:

a processor; and

a memory having computer-readable program instructions stored therein,

wherein the detection method described above is performed when the computer readable program instructions are executed by the processor.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned detection method.

Compared with the prior art, the beneficial effects of the detection system, the detection device and the detection medium provided by the invention are the same as the beneficial effects of the detection method in the technical scheme, and the detailed description is omitted here.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of a detection method provided in embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a K-nearest neighbor algorithm provided in embodiment 1 of the present invention;

fig. 3 is a system block diagram of a detection system provided in embodiment 2 of the present invention.

Detailed Description

For the convenience of clearly describing technical solutions of the embodiments of the present invention, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

Example 1:

many enterprises are following the trend, and develop own data quality software products, for example, the data quality software product of the eastern soft group provides abundant verification models of nearly 100 kinds of field level verification, 60 kinds of table level verification, 40 kinds of inter-table rule verification and the like, and auxiliary functions of automatic storage of error data and the like. For example, the data quality management system developed in the state of China improves the accuracy, timeliness and integrity of data by establishing a uniform and standard data quality operation mechanism, establishing a comprehensive and complete abnormal data monitoring mechanism, establishing a process management-based data monitoring mechanism and establishing a complete data quality assessment mechanism. However, the data check rules are abundant and cannot completely identify abnormal data, because the real abnormal data condition is always more complicated than the check rules, and the extra check rules cannot exhaust various abnormal data. And the data quality effect using a plurality of verification rules is not good, because each field is configured with a plurality of rules, which is a boring and tedious task, and the verification is inaccurate once the configuration is less.

The embodiment provides a general abnormal data detection method, which solves the problem of abnormal data detection by using a k neighbor algorithm, and has the advantages of strong universality, simple used check rule and capability of improving the data quality.

The embodiment is used to provide an abnormal data detection method based on a K-nearest neighbor algorithm, as shown in fig. 1, the detection method includes:

s1: determining normal data and abnormal data in the original data set to be detected based on a preset standard; the normal data and the abnormal data constitute a classified data set;

firstly, loading an original data set to be detected of data quality to be analyzed into a database for storage, selecting standard data and non-standard data from the original data set to be detected based on a preset standard, and respectively setting the data types as normal data and abnormal data, namely determining a normal data and an abnormal data from the original data set to be detected, wherein the similarity is 1 and 0 respectively, and forming a classified data set. And classifying the data of which the category is known in the original data set to be detected into classified data, and storing the classified data into the classified data set.

S2: randomly selecting an unclassified data from the original data set to be detected, and calculating the similarity between the unclassified data and each classified data in the classified data set based on a comprehensive verification rule;

randomly selecting data which is not classified in the original data set to be detected, namely randomly selecting data which is not classified as unclassified data.

Wherein calculating the similarity between the unclassified data and each classified data in the classified data set based on the comprehensive verification rule may include:

1) for each classified data in the classified data set, calculating the character length matching degree, the special character matching degree and the data type matching degree between the unclassified data and the classified data;

2) and carrying out weighted summation on the character length matching degree, the special character matching degree and the data type matching degree to obtain the similarity between the unclassified data and the classified data. The greater the similarity, the closer the distance between the two.

Specifically, the similarity is a character length matching degree a + a special character matching degree B + a data type matching degree C;

wherein, a + B + C is 1, A, B, C is the weight of the character length matching degree, the special character matching degree and the data type matching degree, and the specific value of A, B, C is obtained by adjusting according to the experimental accuracy.

Character length matching degree: equal to the length of the short character/the length of the long character, the value < ═ 1.

Matching degree of special characters: characters other than English letters, numbers, Chinese characters and punctuation marks are set as special characters (such as!, @, #,% and so on). If only one of the unclassified data and the classified data has the special character, the matching degree of the special character is 0, and the rest are 1.

Data type matching degree: if the data types of the unclassified data and the classified data are the same as Chinese characters, numbers or English letters, the matching degree of the data types is 1, otherwise, 0 is selected.

More specifically, calculating the character length matching degree between the unclassified data and the classified data includes: a first character length of the unclassified data and a second character length of the classified data are calculated, respectively. Judging whether the first character length is smaller than the second character length; if so, taking the ratio of the first character length to the second character length as the character length matching degree between the unclassified data and the classified data, namely taking the quotient of the first character length divided by the second character length as the character length matching degree; if not, the ratio of the second character length to the first character length is used as the character length matching degree between the unclassified data and the classified data, namely, the quotient of the second character length divided by the first character length is used as the character length matching degree.

Calculating the degree of special character matching between the unclassified data and the classified data includes: judging whether special characters exist in the unclassified data and the classified data or not; the special characters are characters except English letters, numbers, Chinese characters and punctuation marks; if the unclassified data and the classified data both have special characters, or the unclassified data and the classified data do not have special characters, the matching degree of the special characters between the unclassified data and the classified data is 1; otherwise, the special character matching degree between the unclassified data and the classified data is 0.

Calculating the data type matching degree between the unclassified data and the classified data comprises: judging whether the data types of the unclassified data and the classified data are the same; the data types comprise Chinese characters, numbers and English letters; if yes, the data type matching degree between the unclassified data and the classified data is 1; otherwise, the data type matching degree between the unclassified data and the classified data is 0.

The comprehensive inspection rule provided by the embodiment can obtain the similarity between the unclassified data and the classified data by calculating the character length matching degree, the special character matching degree and the data type matching degree, is simple in inspection rule, and can remarkably improve the abnormal data detection efficiency compared with the existing complex inspection rule.

S3: sorting according to the sequence of the similarity from big to small, and selecting classified data corresponding to the first K similarities as K neighbors of the unclassified data;

the initialization K value is 2 and is fixed. Experiments prove that the K takes all classified samples, so that the influence of the early-stage sample classification result on the later-stage sample classification result is easily large, and the K takes a fixed value in the embodiment.

S4: determining the classification of the unclassified data according to the classification of the K neighbors to obtain classified data, and putting the classified data into the classified data set; the categories include normal data and abnormal data;

wherein determining the classification of the unclassified data according to the classification of the K neighbors comprises: and taking the category of most neighbors in the K neighbors as the category of the unclassified data, namely, if the categories of the neighbors which are more than K/2 are the same, the category is the category of the unclassified data. For example, K is 5, and when the category of 3 neighbors is normal data, the category of the unclassified data at this time is the normal data. And when K neighbors are in the K neighbors, K/2 neighbors are normal data, and K/2 neighbors are abnormal data, the classification of the unclassified data is the same as that of the neighbor with the highest similarity in the K neighbors. For example, K is 6, where the category of 3 neighbors is normal data, the category of 3 neighbors is abnormal data, the category of the neighbor with the highest similarity is normal data, and the category of the unclassified data is normal data.

It should be noted that, after the classification of the unclassified data is determined, the unclassified data becomes classified data, and the classified data is added to the classified data set. When the category of the next unclassified data is determined, the new classified data set is taken as the classified data set.

S5: judging whether unclassified data exists in the original data set to be detected;

s6: if so, returning to the step of randomly selecting one unclassified data in the original data set to be detected until each unclassified data in the data set to be detected is detected;

s7: if not, determining abnormal data according to the classified data set.

The abnormal data detection method used in the present embodiment will be further described below by a specific experiment:

1. initializing data: and selecting an original data set to be detected, and selecting normal data and abnormal data from the original data set, wherein the normal data and the abnormal data are shown in a table 1.

TABLE 1

2. And taking the [ somewhere in guo ] as unclassified data, and presetting initial K to be 2.

3. And (4) calculating the similarity of the unclassified data and the classified data, namely calculating the similarity of [ somewhere in Guo ] to [ Zhangin ] and [ 281927 ] respectively.

[ somewhere in guo ] has a similarity of 0.67 × 0.2+ 0.4+1 × 0.4 ═ 0.534

[ somewhere in guo ] has a similarity of 0.5 × 0.2+1 × 0.4+0 × 0.4 ═ 0.5 ═ 0.25

4. Determining classes of K neighbors and unclassified data

Since the number of neighbors K is equal to 2, the neighbors are [ zhangyi ] and [ 281927 ], at this time, K/2 neighbors are normal data, K/2 neighbors are abnormal data, and since the maximum similarity is 0.534, which is the corresponding similarity of the neighbors [ zhangyi ], the classification of the unclassified data is the same as [ zhangyi ], and the classification is normal data.

5. For other unclassified data, the classification is calculated according to the above method, and the abnormal data detection condition is shown in table 2.

TABLE 2

As can be seen from table 2, in the experiment, there are 5 unclassified data, an abnormal data is detected by the algorithm, and the sample [ $8 yes ] is an abnormal data but not detected, so the accuracy of detecting the abnormal data by the method of the present embodiment is 75%. The factors influencing the accuracy are mainly data semantic differences, and a semantic matching degree rule can be added in the subsequent similarity calculation so as to further improve the accuracy of abnormal data detection. However, no appropriate semantic matching algorithm support exists at present, so that the addition of the semantic matching algorithm support is not considered for the moment.

In the embodiment, abnormal data detection is performed by the K-nearest neighbor algorithm, and the principle of the K-nearest neighbor algorithm is described herein

1. Introduction of concept:

the K-nearest neighbor algorithm is to find K nearest neighbors (i.e., the K neighbors mentioned above) to a new input instance in a training data set, and to classify the input instance into a class, if the majority of the K instances belong to the class.

2. Case introduction:

as shown in fig. 2, there are two different types of sample data, which are represented by squares and triangles, respectively, and the data marked by circles in the figure is the data to be classified. That is, it is not known to which class (square or triangle) a circle belongs, and this problem is addressed below: this circle is classified.

If K is 3, the nearest 3 neighbors of the circle are 2 triangles and 1 square, and a few are subordinate to the majority, and the circle is judged to belong to the triangle class based on a statistical method.

If K is 5, the nearest 5 neighbors of the circle are 2 triangles and 3 squares, and a few are subordinate to the majority, and the circle is judged to belong to the square class based on a statistical method.

Therefore, when the current point to be classified is not judged to belong to the known classification, the position characteristics of the current point to be classified can be seen according to the theory of statistics, the weight of the neighbor around the current point to be classified is measured, and the current point to be classified is classified (or distributed) into the class with larger weight. This is the core idea of the K-nearest neighbor algorithm. In the KNN algorithm (K-nearest neighbor algorithm), the selected neighbors are all objects that have been correctly classified. The method only determines the category of the sample to be classified according to the category of the nearest sample or a plurality of samples in the classification decision. The KNN algorithm is simple and effective, is a lazy-learning algorithm, does not need to use a training set for training, and has the training time complexity of 0. The computational complexity of KNN classification is proportional to the number of documents in the training set, i.e., if the total number of documents in the training set is n, the classification time complexity of KNN is O (n). The KNN method, although in principle also depends on the limit theorem, is only associated with a very small number of neighboring samples in the class decision. Because the KNN method mainly determines the class by the limited adjacent samples around, rather than by the method of distinguishing the class domain, the KNN method is more suitable than other methods for the sample sets to be classified with more class domain intersections or overlaps. The model used by the K-nearest neighbor algorithm actually corresponds to a partition of the feature space. The selection of the K value, the distance metric and the classification decision rule are three basic elements of the algorithm: when the K nearest neighbor algorithm is realized, the main consideration is how to carry out quick K nearest neighbor search on training data, which is very necessary when the feature space dimension is large and the training data capacity is large.

The abnormal data detection method based on the K-nearest neighbor algorithm provided by the embodiment is based on the principle of the KNN algorithm, the similarity between the unclassified data and each classified data in the classified data set is calculated based on the comprehensive verification rule, the K neighbors of the unclassified data are determined according to the similarity, the category of the unclassified data is determined according to the category of the K neighbors, the category of each data in the original data set to be detected is further determined based on the similarity, all abnormal data are determined, and therefore the general method for identifying the abnormal data is provided, the verification operation is simplified, the reusability is strong, meanwhile, the operation of configuring the verification rule for each field in data quality verification is reduced, the labor workload is reduced, and the verification efficiency is improved.

Example 2:

the present embodiment is configured to provide an abnormal data detection system based on a K-nearest neighbor algorithm, as shown in fig. 3, the detection system includes:

the classified data set determining module M1 is configured to determine, based on a preset standard, a normal data and an abnormal data in the original data set to be detected; the normal data and the abnormal data constitute a classified data set;

a similarity calculation module M2, configured to randomly select an unclassified data from the original data set to be detected, and calculate a similarity between the unclassified data and each classified data in the classified data set based on a comprehensive verification rule;

the neighbor determining module M3 is configured to sort according to the sequence of the similarity from large to small, and select classified data corresponding to the first K similarities as K neighbors of the unclassified data;

a category determining module M4, configured to determine a category of the unclassified data according to the categories of the K neighbors, to become classified data, and place the classified data into the classified data set; the categories include normal data and abnormal data;

a judging module M5, configured to judge whether there is unclassified data in the original data set to be detected;

a returning module M6, configured to, if yes, return to the step of "randomly selecting one unclassified data in the original data set to be detected", until each unclassified data in the data set to be detected has been detected;

and the abnormal data determining module M7 is used for determining abnormal data according to the classified data set if the abnormal data is not determined.

Example 3:

the embodiment is used for providing an abnormal data detection device based on a K-nearest neighbor algorithm, and the abnormal data detection device includes:

a processor; and

a memory having computer-readable program instructions stored therein,

wherein the detection method of embodiment 1 is performed when the computer readable program instructions are executed by the processor.

Example 4:

the present embodiment is to provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the detection method described in embodiment 1.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

While the invention has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the invention. Accordingly, the specification and figures are merely exemplary of the invention as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An abnormal data detection method based on a K-nearest neighbor algorithm is characterized by comprising the following steps:

if not, determining abnormal data according to the classified data set.

2. The method according to claim 1, wherein the calculating the similarity between the unclassified data and each classified data in the classified data set based on the comprehensive verification rule specifically comprises:

for each classified data in the classified data set, calculating a character length matching degree, a special character matching degree and a data type matching degree between the unclassified data and the classified data;

and carrying out weighted summation on the character length matching degree, the special character matching degree and the data type matching degree to obtain the similarity between the unclassified data and the classified data.

3. The detection method according to claim 2, wherein said calculating a character length matching degree between the unclassified data and the classified data specifically comprises:

calculating a first character length of the unclassified data and a second character length of the classified data, respectively;

judging whether the first character length is smaller than the second character length;

if so, taking the ratio of the first character length to the second character length as the character length matching degree between the unclassified data and the classified data;

if not, taking the ratio of the second character length to the first character length as the character length matching degree between the unclassified data and the classified data.

4. The detection method according to claim 2, wherein said calculating the degree of special character matching between the unclassified data and the classified data specifically comprises:

judging whether special characters exist in the unclassified data and the classified data or not; the special characters are characters except English letters, numbers, Chinese characters and punctuation marks;

if the unclassified data and the classified data both have special characters, or the unclassified data and the classified data do not have special characters, the special character matching degree between the unclassified data and the classified data is 1;

otherwise, the special character matching degree between the unclassified data and the classified data is 0.

5. The detection method according to claim 2, wherein the calculating of the data type matching degree between the unclassified data and the classified data specifically comprises:

judging whether the data types of the unclassified data and the classified data are the same; the data types comprise Chinese characters, numbers and English letters;

if yes, the data type matching degree between the unclassified data and the classified data is 1;

otherwise, the data type matching degree between the unclassified data and the classified data is 0.

6. The detection method according to claim 1, wherein the determining the classification of the unclassified data according to the classification of the K neighbors comprises:

and taking the category of most neighbors in the K neighbors as the category of the unclassified data.

7. The detection method according to claim 6, wherein when K/2 neighbors of the K neighbors are normal data and K/2 neighbors are abnormal data, the classification of the unclassified data is the same as that of the neighbor with the highest similarity among the K neighbors.

8. An abnormal data detection system based on a K-nearest neighbor algorithm, the detection system comprising:

9. An abnormal data detection device based on a K-nearest neighbor algorithm is characterized by comprising:

a processor; and

a memory having computer-readable program instructions stored therein,

wherein the detection method of any one of claims 1-7 is performed when the computer readable program instructions are executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the detection method according to any one of claims 1 to 7.