CN112926650A

CN112926650A - Data anomaly detection method based on feature selection coupling similarity

Info

Publication number: CN112926650A
Application number: CN202110205936.8A
Authority: CN
Inventors: 郭鹏飞; 周新宇
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-06-08

Abstract

The invention provides a data anomaly detection method based on feature selection coupling similarity, and relates to the technical field of anomaly value detection. The method comprises the steps of firstly, selecting a feature subset by using a backward search method, then evaluating the feature subset by using information gain, selecting an optimal feature subset through a continuous iterative cycle process, and reducing the dimensionality of an original data set; and finally, applying the similarity and the distance in a specific clustering algorithm to obtain the final clustering result, wherein the subclass is the most abnormal, and the data object is used as an abnormal value. The method can ensure that the obtained similarity between the objects is more accurate, is greatly helpful for subsequent classification, reduces the false positive rate of classification, ensures that the classification result is more accurate, is more beneficial to determining the subclass regarded as abnormal, ensures that the deviation of the result of abnormal detection is smaller, and has higher efficiency.

Description

Data anomaly detection method based on feature selection coupling similarity

Technical Field

The invention relates to the technical field of abnormal value detection, in particular to a data abnormality detection method based on feature selection coupling similarity.

Background

Proper similarity measures always play a crucial role in data analysis, learning and processing. The intrinsic similarity of the measurement classification data in the unsupervised learning is not fully solved, and the work of carrying out abnormality detection according to the similarity is very little. In recent years, similarity analysis has been a matter of great practical significance in several areas including data mining. By defining some measure of similarity between attribute values, it measures the strength of a relationship between two data objects, the more similar two objects are to each other, the greater the similarity.

Meanwhile, as computer and database technologies develop rapidly, the speed of data accumulation is incomparable with human data processing capabilities. Data mining as a multidisciplinary combined effort of databases, machine learning and statistics is advocating the conversion of piled-up data like mountains into chunks. Researchers and practitioners recognize that data preprocessing is critical to successful data mining in order to effectively use data mining tools. Feature selection is an important and common technique in data preprocessing for data mining. The method reduces the number of features, deletes irrelevant, redundant or noisy data, and brings an instant effect to an application program, accelerates a data mining algorithm, and improves mining performance such as prediction accuracy and result understandability. The optimality of the feature subset is measured by an evaluation criterion. As the dimension of a domain increases, the number of features also increases. Finding the best feature subset is often tricky, and many of the problems associated with feature selection have proven to be NP-hard, i.e. running a non-deterministic algorithm in polynomial time, which goes through the guessing and validation stages. A typical feature selection process includes four basic steps, namely subset generation, subset evaluation, stopping criteria and results.

Most of the similarity measures in the past are based on all features, and the dimension of the features is not reduced, such as the CMS method and the COS method. The CMS method calculates the coupling similarity of non-independent same-distribution data on the whole dimension, and in the experiment, the effectiveness of the CMS is verified from each dimension by adopting a mode of combining CMS and other similarity measurement methods with spectral clustering and K-means; the COS method verifies that the time complexity of IRSI is the minimum under the same efficiency by comparing IRSP based on power set, IRSU based on full set, IRSJ based on connected set and IRSI based on intersection, because the number of objects is the minimum by the intersection-based method, thereby reducing the time complexity. There are many similarity calculation methods, including cosine similarity calculation, similarity calculation based on Jaccard coefficient and similarity calculation based on Pearson coefficient, etc. different similarity calculation methods have their own advantages and disadvantages.

Similarity learning of classified data has received increasing attention in recent years. Compared with numerical data similarity learning, classification data similarity learning is more complex and has limited research results. Matching-based measures are typical, and if the attribute values of two objects are the same, matching-based measures simply specify similarity as 1; otherwise, it assigns 0. However, such simple match-based measurement methods often lead to misleading learning results, and they ignore hidden similarities between classification values. In addition, the inverse frequency of occurrence based metric takes into account the frequency of occurrence distribution. IOFs are related to the concept of reverse document frequency, assigning a lower similarity to mismatches at more frequent values, and vice versa. For mismatches, the OF metric gives the opposite weight to the IOF metric. In supervised learning, some methods have also conducted intensive studies on learning the similarity between two classification values. One classical similarity metric in supervised learning is based on class label Value Distance Matrix (VDM) and Modified Value Distance Matrix (MVDM). Both methods measure the distance between two numerical attribute values in a multidimensional attribute space for supervised learning and modify the distance by a weighting scheme. The HVDM isomery value difference metric method is proposed to cater for the classification properties. More and more researchers are also concerned with similarity analysis for unsupervised learning. One key point is that attribute value similarity also depends on other attributes. Typical work in this area is to apply pearson and Jaccard coefficients between values. The pearson correlation coefficient reflects only the strength of the linear correlation in the digital data. The Jaccard similarity coefficient statistically compares the similarity and diversity of a sample set and is widely used in data mining tasks. Various techniques for learning the similarity of classified data have been explored. Iterative Context Distance (ICD) algorithms believe that the similarity of attributes and objects is interdependent. ICD considers and iterates attribute similarity, sub-relationship similarity and row similarity; however, it faces a number of problems including choice of starting point, database scan time, iteration, and convergence. This work takes into account the overall distribution of two attribute values in a dataset, and their co-occurrence with other attribute values. The similarity only considers the value co-occurrence and does not consider the hierarchical similarity from the attribute value to the object; furthermore, the computational cost is high. No theoretical basis and analysis is provided for the measurement attributes. The COS method has been applied to classification, recommendation systems, text mining, keyword query, and video processing.

However, COS is not based on similarity of metrics, and does not provide a theoretical basis and analysis to verify its metric properties and determine why it works for reasonable reasons. Neither does CMS take into account that irrelevant features only unnecessarily increase the cost and time complexity of the computation.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for detecting data anomaly based on feature selection coupling similarity, which is to perform feature selection on high-dimensional data first, reduce irrelevant features, and retain subsequent features favorable for classification. And performing coupling similarity calculation on the basis, and flexibly capturing heterogeneous coupling relation from values to attributes to objects. The method can flexibly adapt to various types of data. And finally, classifying by combining a specific clustering method, regarding the clusters with small abnormal object number in the clustering clusters as abnormal clusters, and regarding the data objects in the clusters as abnormal points.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a data anomaly detection method based on feature selection coupling similarity comprises the following steps:

step 1: selecting characteristics, namely removing irrelevant characteristics in data and removing the characteristics irrelevant to a learning task at present;

a backward characteristic subset searching method is adopted, namely, one characteristic is removed from the full set each time to form a plurality of subsets; given a data set D, U is the set of data objects in the data set D, and C is the set of attributes that each data object contains, i.e., U ═₁,u₂,...,u_m}，C＝{c₁,c₂,...,c_nAnd B, the values of U and C are not null, m and n are determined according to an actually given data set, V is a set of all attribute values, and for the attribute subset A, D is divided into G data subsets { D) according to the values of the attribute subset A¹,D²,...,D^GCalculating the information gain of the attribute subset A according to the following formula;

wherein the information entropy is defined as:

wherein p is_rThe ratio of the r-th type samples in D is (r is 1, 2., | y |), the total | y | type samples in the data set are represented, and y is sample marking information;

taking formula (1) as the evaluation criterion of the selected feature subsets, the larger the information gain (a), the more information the feature subset a contains to contribute to classification, and then, for each candidate feature subset, the information gain thereof is calculated based on the data set D as the evaluation criterion; the optimization function that the optimal feature subset needs to satisfy is:

max Gain(A) (3)

with the continuous reduction of the features, when the subsets are empty or the information gain is not increased, selecting the attribute subset A' with the maximum information gain as an optimal feature subset, wherein the optimal feature subset comprises the number of features o, and the number of the features o is less than or equal to n;

step 2: performing coupling similarity calculation on the optimal feature subset selected by the previous feature selection;

the method for converting the distance metric into the similarity metric is as follows:

s(u_x,u_y)＝1/(1+δ(u_x,u_y)) (4)

wherein, δ (u)_x,u_y) Is a data object u_xAnd aAccording to object u_yDistance between s (u)_x,u_y) Is a data object u_xAnd data object u_yThe similarity between the two data objects is 1 when x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and x is equal to y;

conditional probability definition of attribute values: given attribute c_kProperty value v of_k，c_kBelongs to C, k is more than or equal to 1 and less than or equal to n, and an object u_xIs given as the attribute value v of the jth attribute of (1)_jx，u_xBelongs to U, then v_kAbout v_jxConditional probability of (v) p (v)_k|v_jx) Is defined as:

p(v_k|v_jx)＝|I(v_jx,v_k)|/|I(v_jx)| (5)

intra-attribute similarity definition: data object u_xAnd data object u_yIn the same attribute c_jTwo attribute values v of_jxAnd v_jyThe intra-attribute similarity between them is defined as follows:

where log is the natural logarithm and p is represented at attribute c_jUpper attribute value is v_jxJ is more than or equal to 1 and less than or equal to n, and q is represented in the attribute c_jUpper attribute value is v_jyThe number of data objects of (1) is added; equation (6) reflects that different frequency of occurrence represents different levels of importance of the attribute values; attribute value v_jxAnd an attribute value v_jyThe number of occurrences of (A) is more than or equal to 1 and less than or equal to m, and the similarity value range between the two is (0, 1)](ii) a If the attribute value v_jxAnd an attribute value v_jyIs not equal, then when v is not equal_jxAnd v_jyWhen the occurrence times are the same, the similarity of the two reaches the maximum value;

definition of intersection of attribute value co-occurrence conditional probabilities: attribute c_jProperty value v of_jxAnd v_jyAnd attribute c_kThe intersection of the co-occurrence conditional probabilities of co-occurrence values (1. ltoreq. j.ltoreq.n, j. not equal to k) is defined as follows:

W_k＝v_k(I(vjx))∩v_k(I(vjy)) (7)

wherein v is_k(I(vjx))Is all in I (v)_jx) Of the object over attribute k, W_kContains an attribute c_kAnd v_jxAnd v_jyAll attribute values that are co-occurring;

according to Jaccard distance and formula (4), based on IRSI and W_kDefining similarity between attributes by Jaccard similarity;

the Jaccard distance is as follows:

δ_J(u_x,u_y)＝1-J(u_x,u_y) (8)

wherein, J (u)_x,u_y) Is defined as:

J(u_x,u_y)＝∑_f min(u_xf,u_yf)/∑_f max(u_xf,u_yf) (9)

wherein u is_x＝(u_x1,u_x2,...,u_xn) And u_y＝(u_y1,u_y2,...,u_yn) The two n-dimensional vectors are real number vectors, and f is more than or equal to 1 and less than or equal to n;

the similarity between attributes of one attribute with respect to another attribute defines: attribute c_jTwo attribute values v of_jxAnd v_jyWith respect to another attribute c_kThe similarity between the attributes of (a) is defined as follows:

wherein e is max (p)_xi,p_yi)，l＝min(p_xi,p_yi)；p_xi＝p(w_ki|v_jx)，p_yi＝p(w_ki|v_jy) They are w_kiAbout the value v of an attribute_jxAnd v_jyConditional probability of (p)_xiAnd p_yiCalculated from equation (5), w_kiIs W_kIf W is the ith element of_kIs an empty set, then S_Iek | j ═ a, a is an approachA positive number at 0;

the similarity between attributes is defined as follows: attribute c_jTwo attribute values v of_jxAnd v_jyThe similarity between the attributes is:

wherein, γ_k|jRepresents each attribute c_kTo c_jThe weight of (a) is determined,

γ_k|j∈[0,1]，γ_k|jrepresenting the property c_jAnd attribute c_kThe relationship between;

after the similarity among the attributes is calculated, defining the similarity of the coupling measurement attribute values;

coupling metric attribute value similarity definition: attribute c_jProperty value v of_jxAnd v_jyThe similarity of the coupling metric attribute values between them is defined as:

S^j(v_jx,v_jy)＝αS_Iaj+(1-α)S_Iej (12)

wherein, alpha is ∈ [0,1 ]]The different alpha values reflect different proportions of intra-attribute similarity and inter-attribute similarity in forming overall object similarity; a larger alpha indicates that the attribute in-coupling plays a more important role in object similarity, and a smaller alpha indicates that the attribute between-attribute coupling plays a more important role in object similarity, namely, the attribute c_jAnd other properties_jThe coupling between the medians plays a more important role; when alpha is 0.5, S^jIs S_Iaj and S_IeA harmonic mean of j;

coupling metric similarity definition: two objects u_xAnd u_ySimilarity between S (u)_x,u_y) Is defined as:

wherein, beta_jRepresentation attribute c_jIs measured by the weight of similarity of attribute values, beta_j∈[0,1]And is

And step 3: detecting clustering abnormal points;

and training by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and taking data points in the subclasses obtained by clustering as abnormal values.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the data anomaly detection method based on the feature selection coupling similarity provided by the invention adopts the feature selection coupling similarity method, so that the similarity between the obtained objects is more accurate, the subsequent classification is greatly facilitated, the false positive rate of the classification is reduced, the classification result is more accurate, the subclass regarded as anomaly is more favorably determined, the deviation of the result of anomaly detection is smaller, and the efficiency is higher.

Drawings

Fig. 1 is a flowchart of a data anomaly detection method based on feature selection coupling similarity according to an embodiment of the present invention;

fig. 2 is a flowchart of a feature selection process according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Because most of the existing various data sets are high-dimensional data sets, in the work, feature selection is firstly carried out on the high-dimensional data, irrelevant features are reduced, and features which are beneficial to classification in the follow-up process are reserved. And performing coupling similarity calculation on the basis, and flexibly capturing heterogeneous coupling relation from values to attributes to objects. The method can flexibly adapt to various types of data. And finally, classifying by combining a specific clustering method, regarding the clusters with small abnormal object number in the clustering clusters as abnormal clusters, and regarding the data objects in the clusters as abnormal points.

The present embodiment provides a data anomaly detection method based on feature selection coupling similarity, as shown in fig. 1, the specific method is as follows.

Step 1: and (4) feature selection, namely removing irrelevant features in the data and removing the current features irrelevant to the learning task. The process of feature selection is shown in fig. 2.

wherein the information entropy is defined as:

max Gain(A) (3)

with the continuous reduction of the features, when the subsets are empty or the information gain is not increased, the attribute subset A' with the largest information gain is selected as the optimal feature subset, the number of the features included in the optimal feature subset is o, and o is less than or equal to n.

And subsequently performing coupling similarity calculation and clustering on the optimal feature subset. The feature selection effectively reduces the number of similarity calculation times in subsequent attributes and among the attributes, reduces the complexity of calculation on the whole and improves the detection effectiveness.

Step 2: and performing coupling similarity calculation on the optimal feature subset selected by the previous feature selection. The calculation times of the coupling similarity between the attributes are reduced, unnecessary data coupling is avoided, the calculation cost is saved, and the time complexity is reduced.

s(u_x,u_y)＝1/(1+δ(u_x,u_y)) (4)

wherein, δ (u)_x,u_y) Is a data object u_xAnd data object u_yDistance between s (u)_x,u_y) Is a data object u_xAnd data object u_yWherein x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and the similarity between the two data objects is 1.

p(v_k|v_jx)＝|I(v_jx,v_k)|/|I(v_jx)| (5)

where log is the natural logarithm and p is represented at attribute c_jUpper attribute value is v_jxJ is more than or equal to 1 and less than or equal to n, and q is represented in the attribute c_jUpper attribute value is v_jyPlus 1. The form of adding 1 is to avoid the current attribute c_jUpper attribute value is v_jxNumber of data objects and in attribute c_jUpper attribute value is v_jyWhen the number of the data objects is 1, the denominator is 0, and the increase of the similarity can be controlled when the data sharply increases by using the log function. Equation (6) reflects that different frequency of occurrence represents different levels of importance of the attribute values; attribute value v_jxAnd an attribute value v_jyThe number of occurrences of (A) is more than or equal to 1 and less than or equal to m, and the similarity value range between the two is (0, 1)](ii) a If the attribute value v_jxAnd an attribute value v_jyIs not equal, then when v is not equal_jxAnd v_jyWhen the occurrence times are the same, the similarity reaches the maximum.

W_k＝v_k(I(vjx))∩v_k(I(vjy)) (7)

wherein v is_k(I(vjx))Is all in I (v)_jx) Of the object over attribute k, W_kContains an attribute c_kAnd v_jxAnd v_jyAll attribute values co-occur.

The Jaccard similarity coefficient is widely applied to clustering and classification. The Jaccard distance is a distance measure, and is an index related to the similarity coefficient of Jaccard, and the Jaccard distance is as follows:

δ_J(u_x,u_y)＝1-J(u_x,u_y) (8)

wherein, J (u)_x,u_y) Is defined as:

J(u_x,u_y)＝∑_f min(u_xf,u_yf)/∑_f max(u_xf,u_yf) (9)

wherein u is_x＝(u_x1,u_x2,...,u_xn) And u_y＝(u_y1,u_y2,...,u_yn) Both the two n-dimensional vectors are real number vectors, and f is more than or equal to 1 and less than or equal to n.

According to Jaccard distance and formula (4), based on IRSI and W_kSimilarity between attributes is defined by Jaccard similarity.

wherein e is max (p)_xi,p_yi)，l＝min(p_xi,p_yi)；p_xi＝p(w_ki|v_jx)，p_yi＝p(w_ki|v_jy) They are w_kiAbout the value v of an attribute_jxAnd v_jyConditional probability of (p)_xiAnd p_yiCalculated from equation (5), w_kiIs W_kIf W is the ith element of_kIs an empty set, then S_Iek | j ═ a, a is a positive number approaching 0.

γ_k|j∈[0,1]，γ_k|jrepresenting the property c_jAnd attribute c_kThe relationship between them.

After the similarity between the attributes is calculated, the similarity of the coupling metric attribute values is defined.

S^j(v_jx,v_jy)＝αS_Iaj+(1-α)S_Iej (12)

wherein, alpha is ∈ [0,1 ]]The different alpha values reflect different proportions of intra-attribute and inter-attribute similarities in forming overall object similarity. A larger alpha indicates that the attribute in-coupling plays a more important role in object similarity, and a smaller alpha indicates that the attribute between-attribute coupling plays a more important role in object similarity, namely, the attribute c_jAnd other properties_jThe coupling between the medians plays a more important role; when alpha is 0.5, S^jIs S_Iaj and S_IeHarmonic averaging of j.

And step 3: detecting clustering abnormal points;

Data similarity is essentially the distance between data points. The larger the distance is, the lower the similarity is, a few data points in the subclass are the data points which are far away from other data points, the similarity is low, and the data points which are low in similarity with other data points are taken as exceptions.

Clustering needs to use the similarity or distance of sample features as a basis for determining whether the samples belong to a certain class, that is, similar samples are classified into a class, and dissimilar samples are not classified into a class. The clustering method can adopt various common classification methods, and the K-means method is adopted for example in the embodiment. The K-means clustering algorithm first selects a designated clustering center, the number K of the clustering centers is freely designated, if K is 3. Selecting three objects as clustering centers, then looking at the distance from each sample to the clustering centers, dividing the samples into clusters of the clustering centers closest to each other, after obtaining 3 clusters, updating the mean vector again, and repeating the above process continuously until the obtained new clusters are the same as the previous round, stopping the algorithm, and obtaining the final cluster division. In this embodiment, training is performed by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and the obtained objects in the subclass have low similarity with most of the data objects relative to other objects, so that the objects are regarded as abnormal.

In the finally obtained cluster division, the class with the least objects in the cluster is found, and the data points in the cluster are used as abnormal points, because the similarity between the point with the least number in the cluster and other points is low, the abnormal points are used as data abnormality.

Comparing the data anomaly detection method based on the coupling similarity measurement of feature selection proposed in this embodiment with a common similarity measurement method, the following five most advanced similarity/distance measurements are compared with the method of this embodiment: ALGO distance (ALGO for short), similarity OF coupled objects (COS for short), distance measurement (DM for short), Hamming distance (HM for short), and frequency OF occurrence (OF for short). Including all the similarity measures proposed by the present embodiment, are incorporated into a typical distance-based algorithm k-means. Their clustering performance on the classification data was compared to evaluate which similarity metric gave better results.

8 UCI data sets were used for the experiments. Table 1 describes the detailed characteristics of these 8 different data sets. These are the number of objects, the number of attributes, the number of different values for all attributes, the number of classes, and abbreviations. Abbreviations refer to abbreviated forms of data set names. All digital attributes in the dataset are removed, and only the similarity of the classified data is tested.

For the external criteria, some commonly used criteria are selected to compare the clustering results of different similarity measures. Between the cluster label of each object assigned by each clustering algorithm and the ground truth indicated by the data label given in the source data. The criteria may be specified as a score. The larger the standard, the better the performance obtained by clustering; the corresponding similarity measure is correspondingly more efficient and the detection of outliers is also more efficient. F-score is defined as follows:

F₁＝2*TP/(2*TP+FP+FN) (14)

wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.

TABLE 1 data set characterization

Data set	Number of objects	Number of attributes	Number of invalid values of attribute	Number of classifications	Abbreviations
						Soybeansmall	47	35	97	4	So
Zoo	101	16	36	7	Zo
						DNAPromoter	106	57	228	2	Dp
Hayesroth	132	4	15	3	Ha
						Lymphography	148	18	59	4	Ly
Hepatitis	155	13	36	2	He
						Housevotes	232	16	32	2	Ho
Spect	267	22	44	2	Sp

The results obtained after the proposed method was compared with 5 other methods of calculating similarity, combined in a K-means cluster, are shown in table 2. Compared with other similarity measurement methods, the method has the advantages that the clustering effect is better, the classification performance is improved, and the subsequent abnormal value detection is facilitated.

TABLE 2 comparison of several similarity calculation methods F-score results

For the method proposed in this embodiment, different α values may produce different classification effects, and the optimal value of α is also different for each different data set. Wherein the clustering performance of different datasets for different values of α is shown in table 3 below.

TABLE 3 comparison of alpha values for performance

Experiment comparison shows that the optimal values of alpha calculated by the similarity corresponding to different data sets are different, so that the same parameter value cannot be simply acquired for all the data sets in a single plane in the application process. However, many of the previous methods show that the theoretical optimal value of α is 0.5, and according to the above description, when α is 0.5, the similarity of the object is the harmonic mean of the intra-attribute similarity and the inter-attribute similarity, and theoretically, when α is 0.5, the best similarity measure can be obtained, so that the subsequent classification accuracy is higher and the performance is better.

In most abnormal value detection methods, a combination of clustering and abnormal value detection is adopted. In this embodiment, it is proved that by using the feature selection coupling similarity method, the obtained similarity between the objects can be more accurate, which greatly helps the subsequent classification, reduces the false positive rate of the classification, makes the classification result more accurate, and is more favorable for determining the subclass regarded as abnormal, so that the deviation of the result of the abnormal detection is smaller, and the efficiency is higher.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A data anomaly detection method based on feature selection coupling similarity is characterized in that: the method comprises the following steps:

using backward feature subset search methods, i.e. starting from the corpus, removing one feature at a timeForming a plurality of subsets; given a data set D, U is the set of data objects in the data set D, and C is the set of attributes that each data object contains, i.e., U ═₁,u₂,...,u_m}，C＝{c₁,c₂,...,c_nAnd B, the values of U and C are not null, m and n are determined according to an actually given data set, V is a set of all attribute values, and for the attribute subset A, D is divided into G data subsets { D) according to the values of the attribute subset A¹,D²,...,D^GCalculating the information gain of the attribute subset A according to the following formula;

wherein the information entropy is defined as:

max Gain(A) (3)

s(u_x,u_y)＝1/(1+δ(u_x,u_y)) (4)

wherein, δ (u)_x,u_y) Is a data object u_xAnd data object u_yDistance between s (u)_x,u_y) Is a data object u_xAnd data object u_yThe similarity between the two data objects is 1 when x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and x is equal to y;

p(v_k|v_jx)＝|I(v_jx,v_k)|/|I(v_jx)| (5)

W_k＝v_k(I(vjx))∩v_k(I(vjy)) (7)

the Jaccard distance is as follows:

δ_J(u_x,u_y)＝1-J(u_x,u_y) (8)

wherein, J (u)_x,u_y) Is defined as:

J(u_x,u_y)＝∑_fmin(u_xf,u_yf)/∑_fmax(u_xf,u_yf) (9)

wherein e is max (p)_xi,p_yi)，l＝min(p_xi,p_yi)；p_xi＝p(w_ki|v_jx)，p_yi＝p(w_ki|v_jy) They are w_kiAbout the value v of an attribute_jxAnd v_jyConditional probability of (p)_xiAnd p_yiCalculated from equation (5), w_kiIs W_kIf W is the ith element of_kIs an empty set, then S_Iek | j ═ a, a is a positive number approaching 0;

S^j(v_jx,v_jy)＝αS_Iaj+(1-α)S_Iej (12)

And step 3: detecting clustering abnormal points;