CN112926650A - Data anomaly detection method based on feature selection coupling similarity - Google Patents

Data anomaly detection method based on feature selection coupling similarity Download PDF

Info

Publication number
CN112926650A
CN112926650A CN202110205936.8A CN202110205936A CN112926650A CN 112926650 A CN112926650 A CN 112926650A CN 202110205936 A CN202110205936 A CN 202110205936A CN 112926650 A CN112926650 A CN 112926650A
Authority
CN
China
Prior art keywords
attribute
similarity
data
equal
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110205936.8A
Other languages
Chinese (zh)
Inventor
郭鹏飞
周新宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN202110205936.8A priority Critical patent/CN112926650A/en
Publication of CN112926650A publication Critical patent/CN112926650A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data anomaly detection method based on feature selection coupling similarity, and relates to the technical field of anomaly value detection. The method comprises the steps of firstly, selecting a feature subset by using a backward search method, then evaluating the feature subset by using information gain, selecting an optimal feature subset through a continuous iterative cycle process, and reducing the dimensionality of an original data set; and finally, applying the similarity and the distance in a specific clustering algorithm to obtain the final clustering result, wherein the subclass is the most abnormal, and the data object is used as an abnormal value. The method can ensure that the obtained similarity between the objects is more accurate, is greatly helpful for subsequent classification, reduces the false positive rate of classification, ensures that the classification result is more accurate, is more beneficial to determining the subclass regarded as abnormal, ensures that the deviation of the result of abnormal detection is smaller, and has higher efficiency.

Description

Data anomaly detection method based on feature selection coupling similarity
Technical Field
The invention relates to the technical field of abnormal value detection, in particular to a data abnormality detection method based on feature selection coupling similarity.
Background
Proper similarity measures always play a crucial role in data analysis, learning and processing. The intrinsic similarity of the measurement classification data in the unsupervised learning is not fully solved, and the work of carrying out abnormality detection according to the similarity is very little. In recent years, similarity analysis has been a matter of great practical significance in several areas including data mining. By defining some measure of similarity between attribute values, it measures the strength of a relationship between two data objects, the more similar two objects are to each other, the greater the similarity.
Meanwhile, as computer and database technologies develop rapidly, the speed of data accumulation is incomparable with human data processing capabilities. Data mining as a multidisciplinary combined effort of databases, machine learning and statistics is advocating the conversion of piled-up data like mountains into chunks. Researchers and practitioners recognize that data preprocessing is critical to successful data mining in order to effectively use data mining tools. Feature selection is an important and common technique in data preprocessing for data mining. The method reduces the number of features, deletes irrelevant, redundant or noisy data, and brings an instant effect to an application program, accelerates a data mining algorithm, and improves mining performance such as prediction accuracy and result understandability. The optimality of the feature subset is measured by an evaluation criterion. As the dimension of a domain increases, the number of features also increases. Finding the best feature subset is often tricky, and many of the problems associated with feature selection have proven to be NP-hard, i.e. running a non-deterministic algorithm in polynomial time, which goes through the guessing and validation stages. A typical feature selection process includes four basic steps, namely subset generation, subset evaluation, stopping criteria and results.
Most of the similarity measures in the past are based on all features, and the dimension of the features is not reduced, such as the CMS method and the COS method. The CMS method calculates the coupling similarity of non-independent same-distribution data on the whole dimension, and in the experiment, the effectiveness of the CMS is verified from each dimension by adopting a mode of combining CMS and other similarity measurement methods with spectral clustering and K-means; the COS method verifies that the time complexity of IRSI is the minimum under the same efficiency by comparing IRSP based on power set, IRSU based on full set, IRSJ based on connected set and IRSI based on intersection, because the number of objects is the minimum by the intersection-based method, thereby reducing the time complexity. There are many similarity calculation methods, including cosine similarity calculation, similarity calculation based on Jaccard coefficient and similarity calculation based on Pearson coefficient, etc. different similarity calculation methods have their own advantages and disadvantages.
Similarity learning of classified data has received increasing attention in recent years. Compared with numerical data similarity learning, classification data similarity learning is more complex and has limited research results. Matching-based measures are typical, and if the attribute values of two objects are the same, matching-based measures simply specify similarity as 1; otherwise, it assigns 0. However, such simple match-based measurement methods often lead to misleading learning results, and they ignore hidden similarities between classification values. In addition, the inverse frequency of occurrence based metric takes into account the frequency of occurrence distribution. IOFs are related to the concept of reverse document frequency, assigning a lower similarity to mismatches at more frequent values, and vice versa. For mismatches, the OF metric gives the opposite weight to the IOF metric. In supervised learning, some methods have also conducted intensive studies on learning the similarity between two classification values. One classical similarity metric in supervised learning is based on class label Value Distance Matrix (VDM) and Modified Value Distance Matrix (MVDM). Both methods measure the distance between two numerical attribute values in a multidimensional attribute space for supervised learning and modify the distance by a weighting scheme. The HVDM isomery value difference metric method is proposed to cater for the classification properties. More and more researchers are also concerned with similarity analysis for unsupervised learning. One key point is that attribute value similarity also depends on other attributes. Typical work in this area is to apply pearson and Jaccard coefficients between values. The pearson correlation coefficient reflects only the strength of the linear correlation in the digital data. The Jaccard similarity coefficient statistically compares the similarity and diversity of a sample set and is widely used in data mining tasks. Various techniques for learning the similarity of classified data have been explored. Iterative Context Distance (ICD) algorithms believe that the similarity of attributes and objects is interdependent. ICD considers and iterates attribute similarity, sub-relationship similarity and row similarity; however, it faces a number of problems including choice of starting point, database scan time, iteration, and convergence. This work takes into account the overall distribution of two attribute values in a dataset, and their co-occurrence with other attribute values. The similarity only considers the value co-occurrence and does not consider the hierarchical similarity from the attribute value to the object; furthermore, the computational cost is high. No theoretical basis and analysis is provided for the measurement attributes. The COS method has been applied to classification, recommendation systems, text mining, keyword query, and video processing.
However, COS is not based on similarity of metrics, and does not provide a theoretical basis and analysis to verify its metric properties and determine why it works for reasonable reasons. Neither does CMS take into account that irrelevant features only unnecessarily increase the cost and time complexity of the computation.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for detecting data anomaly based on feature selection coupling similarity, which is to perform feature selection on high-dimensional data first, reduce irrelevant features, and retain subsequent features favorable for classification. And performing coupling similarity calculation on the basis, and flexibly capturing heterogeneous coupling relation from values to attributes to objects. The method can flexibly adapt to various types of data. And finally, classifying by combining a specific clustering method, regarding the clusters with small abnormal object number in the clustering clusters as abnormal clusters, and regarding the data objects in the clusters as abnormal points.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a data anomaly detection method based on feature selection coupling similarity comprises the following steps:
step 1: selecting characteristics, namely removing irrelevant characteristics in data and removing the characteristics irrelevant to a learning task at present;
a backward characteristic subset searching method is adopted, namely, one characteristic is removed from the full set each time to form a plurality of subsets; given a data set D, U is the set of data objects in the data set D, and C is the set of attributes that each data object contains, i.e., U ═1,u2,...,um},C={c1,c2,...,cnAnd B, the values of U and C are not null, m and n are determined according to an actually given data set, V is a set of all attribute values, and for the attribute subset A, D is divided into G data subsets { D) according to the values of the attribute subset A1,D2,...,DGCalculating the information gain of the attribute subset A according to the following formula;
Figure BDA0002950599170000031
wherein the information entropy is defined as:
Figure BDA0002950599170000032
wherein p isrThe ratio of the r-th type samples in D is (r is 1, 2., | y |), the total | y | type samples in the data set are represented, and y is sample marking information;
taking formula (1) as the evaluation criterion of the selected feature subsets, the larger the information gain (a), the more information the feature subset a contains to contribute to classification, and then, for each candidate feature subset, the information gain thereof is calculated based on the data set D as the evaluation criterion; the optimization function that the optimal feature subset needs to satisfy is:
max Gain(A) (3)
with the continuous reduction of the features, when the subsets are empty or the information gain is not increased, selecting the attribute subset A' with the maximum information gain as an optimal feature subset, wherein the optimal feature subset comprises the number of features o, and the number of the features o is less than or equal to n;
step 2: performing coupling similarity calculation on the optimal feature subset selected by the previous feature selection;
the method for converting the distance metric into the similarity metric is as follows:
s(ux,uy)=1/(1+δ(ux,uy)) (4)
wherein, δ (u)x,uy) Is a data object uxAnd aAccording to object uyDistance between s (u)x,uy) Is a data object uxAnd data object uyThe similarity between the two data objects is 1 when x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and x is equal to y;
conditional probability definition of attribute values: given attribute ckProperty value v ofk,ckBelongs to C, k is more than or equal to 1 and less than or equal to n, and an object uxIs given as the attribute value v of the jth attribute of (1)jx,uxBelongs to U, then vkAbout vjxConditional probability of (v) p (v)k|vjx) Is defined as:
p(vk|vjx)=|I(vjx,vk)|/|I(vjx)| (5)
intra-attribute similarity definition: data object uxAnd data object uyIn the same attribute cjTwo attribute values v ofjxAnd vjyThe intra-attribute similarity between them is defined as follows:
Figure BDA0002950599170000033
where log is the natural logarithm and p is represented at attribute cjUpper attribute value is vjxJ is more than or equal to 1 and less than or equal to n, and q is represented in the attribute cjUpper attribute value is vjyThe number of data objects of (1) is added; equation (6) reflects that different frequency of occurrence represents different levels of importance of the attribute values; attribute value vjxAnd an attribute value vjyThe number of occurrences of (A) is more than or equal to 1 and less than or equal to m, and the similarity value range between the two is (0, 1)](ii) a If the attribute value vjxAnd an attribute value vjyIs not equal, then when v is not equaljxAnd vjyWhen the occurrence times are the same, the similarity of the two reaches the maximum value;
definition of intersection of attribute value co-occurrence conditional probabilities: attribute cjProperty value v ofjxAnd vjyAnd attribute ckThe intersection of the co-occurrence conditional probabilities of co-occurrence values (1. ltoreq. j.ltoreq.n, j. not equal to k) is defined as follows:
Wk=vk(I(vjx))∩vk(I(vjy)) (7)
wherein v isk(I(vjx))Is all in I (v)jx) Of the object over attribute k, WkContains an attribute ckAnd vjxAnd vjyAll attribute values that are co-occurring;
according to Jaccard distance and formula (4), based on IRSI and WkDefining similarity between attributes by Jaccard similarity;
the Jaccard distance is as follows:
δJ(ux,uy)=1-J(ux,uy) (8)
wherein, J (u)x,uy) Is defined as:
J(ux,uy)=∑f min(uxf,uyf)/∑f max(uxf,uyf) (9)
wherein u isx=(ux1,ux2,...,uxn) And uy=(uy1,uy2,...,uyn) The two n-dimensional vectors are real number vectors, and f is more than or equal to 1 and less than or equal to n;
the similarity between attributes of one attribute with respect to another attribute defines: attribute cjTwo attribute values v ofjxAnd vjyWith respect to another attribute ckThe similarity between the attributes of (a) is defined as follows:
Figure BDA0002950599170000041
wherein e is max (p)xi,pyi),l=min(pxi,pyi);pxi=p(wki|vjx),pyi=p(wki|vjy) They are wkiAbout the value v of an attributejxAnd vjyConditional probability of (p)xiAnd pyiCalculated from equation (5), wkiIs WkIf W is the ith element ofkIs an empty set, then SIek | j ═ a, a is an approachA positive number at 0;
the similarity between attributes is defined as follows: attribute cjTwo attribute values v ofjxAnd vjyThe similarity between the attributes is:
Figure BDA0002950599170000042
wherein, γk|jRepresents each attribute ckTo cjThe weight of (a) is determined,
Figure BDA0002950599170000043
γk|j∈[0,1],γk|jrepresenting the property cjAnd attribute ckThe relationship between;
after the similarity among the attributes is calculated, defining the similarity of the coupling measurement attribute values;
coupling metric attribute value similarity definition: attribute cjProperty value v ofjxAnd vjyThe similarity of the coupling metric attribute values between them is defined as:
Sj(vjx,vjy)=αSIaj+(1-α)SIej (12)
wherein, alpha is ∈ [0,1 ]]The different alpha values reflect different proportions of intra-attribute similarity and inter-attribute similarity in forming overall object similarity; a larger alpha indicates that the attribute in-coupling plays a more important role in object similarity, and a smaller alpha indicates that the attribute between-attribute coupling plays a more important role in object similarity, namely, the attribute cjAnd other propertiesjThe coupling between the medians plays a more important role; when alpha is 0.5, SjIs SIaj and SIeA harmonic mean of j;
coupling metric similarity definition: two objects uxAnd uySimilarity between S (u)x,uy) Is defined as:
Figure BDA0002950599170000051
wherein, betajRepresentation attribute cjIs measured by the weight of similarity of attribute values, betaj∈[0,1]And is
Figure BDA0002950599170000052
And step 3: detecting clustering abnormal points;
and training by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and taking data points in the subclasses obtained by clustering as abnormal values.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the data anomaly detection method based on the feature selection coupling similarity provided by the invention adopts the feature selection coupling similarity method, so that the similarity between the obtained objects is more accurate, the subsequent classification is greatly facilitated, the false positive rate of the classification is reduced, the classification result is more accurate, the subclass regarded as anomaly is more favorably determined, the deviation of the result of anomaly detection is smaller, and the efficiency is higher.
Drawings
Fig. 1 is a flowchart of a data anomaly detection method based on feature selection coupling similarity according to an embodiment of the present invention;
fig. 2 is a flowchart of a feature selection process according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Because most of the existing various data sets are high-dimensional data sets, in the work, feature selection is firstly carried out on the high-dimensional data, irrelevant features are reduced, and features which are beneficial to classification in the follow-up process are reserved. And performing coupling similarity calculation on the basis, and flexibly capturing heterogeneous coupling relation from values to attributes to objects. The method can flexibly adapt to various types of data. And finally, classifying by combining a specific clustering method, regarding the clusters with small abnormal object number in the clustering clusters as abnormal clusters, and regarding the data objects in the clusters as abnormal points.
The present embodiment provides a data anomaly detection method based on feature selection coupling similarity, as shown in fig. 1, the specific method is as follows.
Step 1: and (4) feature selection, namely removing irrelevant features in the data and removing the current features irrelevant to the learning task. The process of feature selection is shown in fig. 2.
A backward characteristic subset searching method is adopted, namely, one characteristic is removed from the full set each time to form a plurality of subsets; given a data set D, U is the set of data objects in the data set D, and C is the set of attributes that each data object contains, i.e., U ═1,u2,...,um},C={c1,c2,...,cnAnd B, the values of U and C are not null, m and n are determined according to an actually given data set, V is a set of all attribute values, and for the attribute subset A, D is divided into G data subsets { D) according to the values of the attribute subset A1,D2,...,DGCalculating the information gain of the attribute subset A according to the following formula;
Figure BDA0002950599170000061
wherein the information entropy is defined as:
Figure BDA0002950599170000062
wherein p isrThe ratio of the r-th type samples in D is (r is 1, 2., | y |), the total | y | type samples in the data set are represented, and y is sample marking information;
taking formula (1) as the evaluation criterion of the selected feature subsets, the larger the information gain (a), the more information the feature subset a contains to contribute to classification, and then, for each candidate feature subset, the information gain thereof is calculated based on the data set D as the evaluation criterion; the optimization function that the optimal feature subset needs to satisfy is:
max Gain(A) (3)
with the continuous reduction of the features, when the subsets are empty or the information gain is not increased, the attribute subset A' with the largest information gain is selected as the optimal feature subset, the number of the features included in the optimal feature subset is o, and o is less than or equal to n.
And subsequently performing coupling similarity calculation and clustering on the optimal feature subset. The feature selection effectively reduces the number of similarity calculation times in subsequent attributes and among the attributes, reduces the complexity of calculation on the whole and improves the detection effectiveness.
Step 2: and performing coupling similarity calculation on the optimal feature subset selected by the previous feature selection. The calculation times of the coupling similarity between the attributes are reduced, unnecessary data coupling is avoided, the calculation cost is saved, and the time complexity is reduced.
The method for converting the distance metric into the similarity metric is as follows:
s(ux,uy)=1/(1+δ(ux,uy)) (4)
wherein, δ (u)x,uy) Is a data object uxAnd data object uyDistance between s (u)x,uy) Is a data object uxAnd data object uyWherein x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and the similarity between the two data objects is 1.
Conditional probability definition of attribute values: given attribute ckProperty value v ofk,ckBelongs to C, k is more than or equal to 1 and less than or equal to n, and an object uxIs given as the attribute value v of the jth attribute of (1)jx,uxBelongs to U, then vkAbout vjxConditional probability of (v) p (v)k|vjx) Is defined as:
p(vk|vjx)=|I(vjx,vk)|/|I(vjx)| (5)
intra-attribute similarity definition: data object uxAnd data object uyIn the same attribute cjTwo attribute values v ofjxAnd vjyThe intra-attribute similarity between them is defined as follows:
Figure BDA0002950599170000071
where log is the natural logarithm and p is represented at attribute cjUpper attribute value is vjxJ is more than or equal to 1 and less than or equal to n, and q is represented in the attribute cjUpper attribute value is vjyPlus 1. The form of adding 1 is to avoid the current attribute cjUpper attribute value is vjxNumber of data objects and in attribute cjUpper attribute value is vjyWhen the number of the data objects is 1, the denominator is 0, and the increase of the similarity can be controlled when the data sharply increases by using the log function. Equation (6) reflects that different frequency of occurrence represents different levels of importance of the attribute values; attribute value vjxAnd an attribute value vjyThe number of occurrences of (A) is more than or equal to 1 and less than or equal to m, and the similarity value range between the two is (0, 1)](ii) a If the attribute value vjxAnd an attribute value vjyIs not equal, then when v is not equaljxAnd vjyWhen the occurrence times are the same, the similarity reaches the maximum.
Definition of intersection of attribute value co-occurrence conditional probabilities: attribute cjProperty value v ofjxAnd vjyAnd attribute ckThe intersection of the co-occurrence conditional probabilities of co-occurrence values (1. ltoreq. j.ltoreq.n, j. not equal to k) is defined as follows:
Wk=vk(I(vjx))∩vk(I(vjy)) (7)
wherein v isk(I(vjx))Is all in I (v)jx) Of the object over attribute k, WkContains an attribute ckAnd vjxAnd vjyAll attribute values co-occur.
The Jaccard similarity coefficient is widely applied to clustering and classification. The Jaccard distance is a distance measure, and is an index related to the similarity coefficient of Jaccard, and the Jaccard distance is as follows:
δJ(ux,uy)=1-J(ux,uy) (8)
wherein, J (u)x,uy) Is defined as:
J(ux,uy)=∑f min(uxf,uyf)/∑f max(uxf,uyf) (9)
wherein u isx=(ux1,ux2,...,uxn) And uy=(uy1,uy2,...,uyn) Both the two n-dimensional vectors are real number vectors, and f is more than or equal to 1 and less than or equal to n.
According to Jaccard distance and formula (4), based on IRSI and WkSimilarity between attributes is defined by Jaccard similarity.
The similarity between attributes of one attribute with respect to another attribute defines: attribute cjTwo attribute values v ofjxAnd vjyWith respect to another attribute ckThe similarity between the attributes of (a) is defined as follows:
Figure BDA0002950599170000072
wherein e is max (p)xi,pyi),l=min(pxi,pyi);pxi=p(wki|vjx),pyi=p(wki|vjy) They are wkiAbout the value v of an attributejxAnd vjyConditional probability of (p)xiAnd pyiCalculated from equation (5), wkiIs WkIf W is the ith element ofkIs an empty set, then SIek | j ═ a, a is a positive number approaching 0.
The similarity between attributes is defined as follows: attribute cjTwo attribute values v ofjxAnd vjyThe similarity between the attributes is:
Figure BDA0002950599170000081
wherein, γk|jRepresents each attribute ckTo cjThe weight of (a) is determined,
Figure BDA0002950599170000082
γk|j∈[0,1],γk|jrepresenting the property cjAnd attribute ckThe relationship between them.
After the similarity between the attributes is calculated, the similarity of the coupling metric attribute values is defined.
Coupling metric attribute value similarity definition: attribute cjProperty value v ofjxAnd vjyThe similarity of the coupling metric attribute values between them is defined as:
Sj(vjx,vjy)=αSIaj+(1-α)SIej (12)
wherein, alpha is ∈ [0,1 ]]The different alpha values reflect different proportions of intra-attribute and inter-attribute similarities in forming overall object similarity. A larger alpha indicates that the attribute in-coupling plays a more important role in object similarity, and a smaller alpha indicates that the attribute between-attribute coupling plays a more important role in object similarity, namely, the attribute cjAnd other propertiesjThe coupling between the medians plays a more important role; when alpha is 0.5, SjIs SIaj and SIeHarmonic averaging of j.
Coupling metric similarity definition: two objects uxAnd uySimilarity between S (u)x,uy) Is defined as:
Figure BDA0002950599170000083
wherein, betajRepresentation attribute cjIs measured by the weight of similarity of attribute values, betaj∈[0,1]And is
Figure BDA0002950599170000084
And step 3: detecting clustering abnormal points;
and training by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and taking data points in the subclasses obtained by clustering as abnormal values.
Data similarity is essentially the distance between data points. The larger the distance is, the lower the similarity is, a few data points in the subclass are the data points which are far away from other data points, the similarity is low, and the data points which are low in similarity with other data points are taken as exceptions.
Clustering needs to use the similarity or distance of sample features as a basis for determining whether the samples belong to a certain class, that is, similar samples are classified into a class, and dissimilar samples are not classified into a class. The clustering method can adopt various common classification methods, and the K-means method is adopted for example in the embodiment. The K-means clustering algorithm first selects a designated clustering center, the number K of the clustering centers is freely designated, if K is 3. Selecting three objects as clustering centers, then looking at the distance from each sample to the clustering centers, dividing the samples into clusters of the clustering centers closest to each other, after obtaining 3 clusters, updating the mean vector again, and repeating the above process continuously until the obtained new clusters are the same as the previous round, stopping the algorithm, and obtaining the final cluster division. In this embodiment, training is performed by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and the obtained objects in the subclass have low similarity with most of the data objects relative to other objects, so that the objects are regarded as abnormal.
In the finally obtained cluster division, the class with the least objects in the cluster is found, and the data points in the cluster are used as abnormal points, because the similarity between the point with the least number in the cluster and other points is low, the abnormal points are used as data abnormality.
Comparing the data anomaly detection method based on the coupling similarity measurement of feature selection proposed in this embodiment with a common similarity measurement method, the following five most advanced similarity/distance measurements are compared with the method of this embodiment: ALGO distance (ALGO for short), similarity OF coupled objects (COS for short), distance measurement (DM for short), Hamming distance (HM for short), and frequency OF occurrence (OF for short). Including all the similarity measures proposed by the present embodiment, are incorporated into a typical distance-based algorithm k-means. Their clustering performance on the classification data was compared to evaluate which similarity metric gave better results.
8 UCI data sets were used for the experiments. Table 1 describes the detailed characteristics of these 8 different data sets. These are the number of objects, the number of attributes, the number of different values for all attributes, the number of classes, and abbreviations. Abbreviations refer to abbreviated forms of data set names. All digital attributes in the dataset are removed, and only the similarity of the classified data is tested.
For the external criteria, some commonly used criteria are selected to compare the clustering results of different similarity measures. Between the cluster label of each object assigned by each clustering algorithm and the ground truth indicated by the data label given in the source data. The criteria may be specified as a score. The larger the standard, the better the performance obtained by clustering; the corresponding similarity measure is correspondingly more efficient and the detection of outliers is also more efficient. F-score is defined as follows:
F1=2*TP/(2*TP+FP+FN) (14)
wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.
TABLE 1 data set characterization
Data set Number of objects Number of attributes Number of invalid values of attribute Number of classifications Abbreviations
Soybeansmall 47 35 97 4 So
Zoo 101 16 36 7 Zo
DNAPromoter 106 57 228 2 Dp
Hayesroth 132 4 15 3 Ha
Lymphography 148 18 59 4 Ly
Hepatitis 155 13 36 2 He
Housevotes 232 16 32 2 Ho
Spect 267 22 44 2 Sp
The results obtained after the proposed method was compared with 5 other methods of calculating similarity, combined in a K-means cluster, are shown in table 2. Compared with other similarity measurement methods, the method has the advantages that the clustering effect is better, the classification performance is improved, and the subsequent abnormal value detection is facilitated.
TABLE 2 comparison of several similarity calculation methods F-score results
Figure BDA0002950599170000101
For the method proposed in this embodiment, different α values may produce different classification effects, and the optimal value of α is also different for each different data set. Wherein the clustering performance of different datasets for different values of α is shown in table 3 below.
TABLE 3 comparison of alpha values for performance
Figure BDA0002950599170000102
Experiment comparison shows that the optimal values of alpha calculated by the similarity corresponding to different data sets are different, so that the same parameter value cannot be simply acquired for all the data sets in a single plane in the application process. However, many of the previous methods show that the theoretical optimal value of α is 0.5, and according to the above description, when α is 0.5, the similarity of the object is the harmonic mean of the intra-attribute similarity and the inter-attribute similarity, and theoretically, when α is 0.5, the best similarity measure can be obtained, so that the subsequent classification accuracy is higher and the performance is better.
In most abnormal value detection methods, a combination of clustering and abnormal value detection is adopted. In this embodiment, it is proved that by using the feature selection coupling similarity method, the obtained similarity between the objects can be more accurate, which greatly helps the subsequent classification, reduces the false positive rate of the classification, makes the classification result more accurate, and is more favorable for determining the subclass regarded as abnormal, so that the deviation of the result of the abnormal detection is smaller, and the efficiency is higher.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (1)

1. A data anomaly detection method based on feature selection coupling similarity is characterized in that: the method comprises the following steps:
step 1: selecting characteristics, namely removing irrelevant characteristics in data and removing the characteristics irrelevant to a learning task at present;
using backward feature subset search methods, i.e. starting from the corpus, removing one feature at a timeForming a plurality of subsets; given a data set D, U is the set of data objects in the data set D, and C is the set of attributes that each data object contains, i.e., U ═1,u2,...,um},C={c1,c2,...,cnAnd B, the values of U and C are not null, m and n are determined according to an actually given data set, V is a set of all attribute values, and for the attribute subset A, D is divided into G data subsets { D) according to the values of the attribute subset A1,D2,...,DGCalculating the information gain of the attribute subset A according to the following formula;
Figure FDA0002950599160000011
wherein the information entropy is defined as:
Figure FDA0002950599160000012
wherein p isrThe ratio of the r-th type samples in D is (r is 1, 2., | y |), the total | y | type samples in the data set are represented, and y is sample marking information;
taking formula (1) as the evaluation criterion of the selected feature subsets, the larger the information gain (a), the more information the feature subset a contains to contribute to classification, and then, for each candidate feature subset, the information gain thereof is calculated based on the data set D as the evaluation criterion; the optimization function that the optimal feature subset needs to satisfy is:
max Gain(A) (3)
with the continuous reduction of the features, when the subsets are empty or the information gain is not increased, selecting the attribute subset A' with the maximum information gain as an optimal feature subset, wherein the optimal feature subset comprises the number of features o, and the number of the features o is less than or equal to n;
step 2: performing coupling similarity calculation on the optimal feature subset selected by the previous feature selection;
the method for converting the distance metric into the similarity metric is as follows:
s(ux,uy)=1/(1+δ(ux,uy)) (4)
wherein, δ (u)x,uy) Is a data object uxAnd data object uyDistance between s (u)x,uy) Is a data object uxAnd data object uyThe similarity between the two data objects is 1 when x is more than or equal to 1 and less than or equal to m, y is more than or equal to 1 and less than or equal to m, and x is equal to y;
conditional probability definition of attribute values: given attribute ckProperty value v ofk,ckBelongs to C, k is more than or equal to 1 and less than or equal to n, and an object uxIs given as the attribute value v of the jth attribute of (1)jx,uxBelongs to U, then vkAbout vjxConditional probability of (v) p (v)k|vjx) Is defined as:
p(vk|vjx)=|I(vjx,vk)|/|I(vjx)| (5)
intra-attribute similarity definition: data object uxAnd data object uyIn the same attribute cjTwo attribute values v ofjxAnd vjyThe intra-attribute similarity between them is defined as follows:
Figure FDA0002950599160000021
where log is the natural logarithm and p is represented at attribute cjUpper attribute value is vjxJ is more than or equal to 1 and less than or equal to n, and q is represented in the attribute cjUpper attribute value is vjyThe number of data objects of (1) is added; equation (6) reflects that different frequency of occurrence represents different levels of importance of the attribute values; attribute value vjxAnd an attribute value vjyThe number of occurrences of (A) is more than or equal to 1 and less than or equal to m, and the similarity value range between the two is (0, 1)](ii) a If the attribute value vjxAnd an attribute value vjyIs not equal, then when v is not equaljxAnd vjyWhen the occurrence times are the same, the similarity of the two reaches the maximum value;
definition of intersection of attribute value co-occurrence conditional probabilities: attribute cjProperty value v ofjxAnd vjyAnd attribute ckThe intersection of the co-occurrence conditional probabilities of co-occurrence values (1. ltoreq. j.ltoreq.n, j. not equal to k) is defined as follows:
Wk=vk(I(vjx))∩vk(I(vjy)) (7)
wherein v isk(I(vjx))Is all in I (v)jx) Of the object over attribute k, WkContains an attribute ckAnd vjxAnd vjyAll attribute values that are co-occurring;
according to Jaccard distance and formula (4), based on IRSI and WkDefining similarity between attributes by Jaccard similarity;
the Jaccard distance is as follows:
δJ(ux,uy)=1-J(ux,uy) (8)
wherein, J (u)x,uy) Is defined as:
J(ux,uy)=∑fmin(uxf,uyf)/∑fmax(uxf,uyf) (9)
wherein u isx=(ux1,ux2,...,uxn) And uy=(uy1,uy2,...,uyn) The two n-dimensional vectors are real number vectors, and f is more than or equal to 1 and less than or equal to n;
the similarity between attributes of one attribute with respect to another attribute defines: attribute cjTwo attribute values v ofjxAnd vjyWith respect to another attribute ckThe similarity between the attributes of (a) is defined as follows:
Figure FDA0002950599160000022
wherein e is max (p)xi,pyi),l=min(pxi,pyi);pxi=p(wki|vjx),pyi=p(wki|vjy) They are wkiAbout the value v of an attributejxAnd vjyConditional probability of (p)xiAnd pyiCalculated from equation (5), wkiIs WkIf W is the ith element ofkIs an empty set, then SIek | j ═ a, a is a positive number approaching 0;
the similarity between attributes is defined as follows: attribute cjTwo attribute values v ofjxAnd vjyThe similarity between the attributes is:
Figure FDA0002950599160000023
wherein, γk|jRepresents each attribute ckTo cjThe weight of (a) is determined,
Figure FDA0002950599160000024
γk|j∈[0,1],γk|jrepresenting the property cjAnd attribute ckThe relationship between;
after the similarity among the attributes is calculated, defining the similarity of the coupling measurement attribute values;
coupling metric attribute value similarity definition: attribute cjProperty value v ofjxAnd vjyThe similarity of the coupling metric attribute values between them is defined as:
Sj(vjx,vjy)=αSIaj+(1-α)SIej (12)
wherein, alpha is ∈ [0,1 ]]The different alpha values reflect different proportions of intra-attribute similarity and inter-attribute similarity in forming overall object similarity; a larger alpha indicates that the attribute in-coupling plays a more important role in object similarity, and a smaller alpha indicates that the attribute between-attribute coupling plays a more important role in object similarity, namely, the attribute cjAnd other propertiesjThe coupling between the medians plays a more important role; when alpha is 0.5, SjIs SIaj and SIeA harmonic mean of j;
coupling metric similarity definition: two objects uxAnd uySimilarity between S (u)x,uy) Is defined as:
Figure FDA0002950599160000031
wherein, betajRepresentation attribute cjIs measured by the weight of similarity of attribute values, betaj∈[0,1]And is
Figure FDA0002950599160000032
And step 3: detecting clustering abnormal points;
and training by combining a clustering algorithm according to the calculated similarity between the objects to obtain a clustering result, and taking data points in the subclasses obtained by clustering as abnormal values.
CN202110205936.8A 2021-02-24 2021-02-24 Data anomaly detection method based on feature selection coupling similarity Withdrawn CN112926650A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110205936.8A CN112926650A (en) 2021-02-24 2021-02-24 Data anomaly detection method based on feature selection coupling similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110205936.8A CN112926650A (en) 2021-02-24 2021-02-24 Data anomaly detection method based on feature selection coupling similarity

Publications (1)

Publication Number Publication Date
CN112926650A true CN112926650A (en) 2021-06-08

Family

ID=76170723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110205936.8A Withdrawn CN112926650A (en) 2021-02-24 2021-02-24 Data anomaly detection method based on feature selection coupling similarity

Country Status (1)

Country Link
CN (1) CN112926650A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052885A (en) * 2023-02-07 2023-05-02 齐鲁工业大学(山东省科学院) System, method, equipment and medium for improving prognosis prediction precision based on improved Relieff cancer histology feature selection algorithm
CN116361345A (en) * 2023-06-01 2023-06-30 新华三人工智能科技有限公司 Feature screening and classifying method, device, equipment and medium for data stream

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052885A (en) * 2023-02-07 2023-05-02 齐鲁工业大学(山东省科学院) System, method, equipment and medium for improving prognosis prediction precision based on improved Relieff cancer histology feature selection algorithm
CN116052885B (en) * 2023-02-07 2024-03-08 齐鲁工业大学(山东省科学院) System, method, equipment and medium for improving prognosis prediction precision based on improved Relieff cancer histology feature selection algorithm
CN116361345A (en) * 2023-06-01 2023-06-30 新华三人工智能科技有限公司 Feature screening and classifying method, device, equipment and medium for data stream
CN116361345B (en) * 2023-06-01 2023-09-22 新华三人工智能科技有限公司 Feature screening and classifying method, device, equipment and medium for data stream

Similar Documents

Publication Publication Date Title
Wang et al. Locality sensitive outlier detection: A ranking driven approach
Zolhavarieh et al. A review of subsequence time series clustering
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
Wang et al. Coupled attribute similarity learning on categorical data
Jian et al. Unsupervised coupled metric similarity for non-IID categorical data
Jayanti et al. Shape-based clustering for 3D CAD objects: A comparative study of effectiveness
CN112926650A (en) Data anomaly detection method based on feature selection coupling similarity
Gond et al. A survey of machine learning-based approaches for missing value imputation
García et al. Data preparation basic models
Cheng et al. A local cores-based hierarchical clustering algorithm for data sets with complex structures
Wang et al. Markov clustering ensemble
Liu et al. Information bottleneck based incremental fuzzy clustering for large biomedical data
Hennig et al. Comparison of time series clustering algorithms for machine state detection
Kelil et al. A general measure of similarity for categorical sequences
Lutz et al. Active clustering for labeling training data
Truong et al. A survey on time series motif discovery
Bouguessa A mixture model-based combination approach for outlier detection
Azis et al. LL-KNN ACW-NB: Local Learning K-Nearest Neighbor in Absolute Correlation Weighted Naïve Bayes for Numerical Data Classification
Tasoulis et al. Generalizing the k-Windows clustering algorithm in metric spaces
Duarte et al. A constraint acquisition method for data clustering
Rudniy et al. Detecting duplicate biological entities using shortest path edit distance
Li et al. EA DTW: Early abandon to accelerate exactly warping matching of time series
Liu et al. An improved quantile-point-based evolutionary segmentation representation method of financial time series.
Lu Dynamic matrix clustering method based on time series
Patel et al. Hierarchical k-means algorithm (hk-means) with automatically detected initial centroids

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210608