CN107341514B - Abnormal point and edge point detection method based on joint density and angle - Google Patents

Abnormal point and edge point detection method based on joint density and angle Download PDF

Info

Publication number
CN107341514B
CN107341514B CN201710548763.3A CN201710548763A CN107341514B CN 107341514 B CN107341514 B CN 107341514B CN 201710548763 A CN201710548763 A CN 201710548763A CN 107341514 B CN107341514 B CN 107341514B
Authority
CN
China
Prior art keywords
points
point
calculating
data set
sample point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710548763.3A
Other languages
Chinese (zh)
Other versions
CN107341514A (en
Inventor
李孝杰
吴锡
吕建成
周激流
李莉丽
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201710548763.3A priority Critical patent/CN107341514B/en
Publication of CN107341514A publication Critical patent/CN107341514A/en
Application granted granted Critical
Publication of CN107341514B publication Critical patent/CN107341514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an abnormal point and edge point detection method based on joint density and angle, which is based on the idea that edge points and abnormal points have lower local density and smaller angle variance change, combines the joint information of angle and density of a data set, utilizes joint measure to judge the degree of sample points belonging to the abnormal points and the edge points, and automatically determines special points by setting a threshold value. The method for detecting the abnormal points and the edge points is stable, improves the performance of detecting the edge points and the abnormal points, can better reflect the characteristics of a data set, can detect noise data and better removes noise. The defect that the effect of detecting special points in a complex data set is poor and unstable in the prior art is overcome.

Description

Abnormal point and edge point detection method based on joint density and angle
Technical Field
The invention relates to the field of data detection, in particular to an abnormal point and edge point detection method based on joint density and angle.
Background
Traditional clustering, classification, pattern recognition techniques are directed to finding general patterns, while detection of specific points, including edge points and outlier points, is often used to identify valid, interesting and potentially valuable patterns in the data. Detection of a special spot is often a more meaningful task than detection of a normal spot. In addition, most algorithms are affected by outliers. For example, the famous nonlinear manifold learning dimension reduction algorithm Isomap itself does not describe the problem related to abnormal point detection, but the code provided by the author includes the abnormal point detection process. Therefore, how to correctly detect the abnormal point in the complex space is a real problem to be solved urgently, and is an important task in data preprocessing.
Researchers at home and abroad propose various detection algorithms from different technical perspectives. From the method of determining outliers, a global model and a local model can be divided. And the global model performs binary judgment on all the observation points so as to judge whether the current observation point is an abnormal point. While local models typically assign a certain metric (e.g., an angular change factor) to an observation point for estimating the degree to which the point belongs to an outlier. Data tag information can be classified into supervised, semi-supervised and unsupervised algorithms depending on whether they are needed. Currently, most algorithms generally employ 5-class shallow technique approaches based on statistics (Statistical-based), Distance-based, neighborhood or Density (Density-based), Clustering (Clustering-based), or bias (development-based).
Distance-Based methods are the most common methods currently used because of geometric clarity, the method uses Distance as a measure, points with no "enough" neighbors are judged as anomalous data, and the idea of statistical-Based inconsistency checking is extended.
Figure GDA0002462577070000021
Where m is the dimension of the data, distmaxAnd distminThe distance of the current point to its farthest neighbor and nearest neighbor, respectively. As m increases, the above equation will go to zero. The high dimensional spatial data ubiquitous to distance metric processing is therefore more or less non-compliant. To address this problem, professor Hans-Peter Kriegel, university of munich, germany, proposes an Angle-Based Outlier Detection (ABOD) algorithm. The method solves the problems to a certain extent, and introduces a new neighbor relation problem.
The density method usually adopts local anomaly factors (L cal Outlier Factor, L OF) to judge the anomaly point, if the density OF the area where the data point is located is lower, the probability that the current point becomes the anomaly point is higher, if the value OF L OF is larger, the more classical methods such as L OF method, INF L O method, inverse KNN method and the like.
Kriegel proposed an angle-based anomaly detection Algorithm (ABOD) in 2008, which was independent of the parameter selection problem and alleviated the dimension disaster problem to some extent. However, the ABOD algorithm only considers the current point's relationship to neighbors and not more of its neighbors, resulting in the algorithm easily identifying the wrong outliers. The edge point is determined by obtaining label information, namely prior information, but some application environments cannot obtain the prior information, so that the application range of the algorithm is limited.
Most of the above classical abnormal point detection algorithms have good effects under specific conditions or specific fields, and when the dimensionality of data is high, the effects of the algorithms are not ideal and the generalization capability is weak. Most of the existing abnormal point detection algorithms only consider single characteristics of data, such as density, angle and the like, to detect abnormality, and aim at the problem that a complex data set has unstable performance of the detection algorithm.
Therefore, how to improve the performance and stability of the abnormal point and edge point detection algorithm becomes an urgent problem to be solved in the field of data detection.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an abnormal point and edge point detection method based on joint density and angle, which comprises the following steps:
step 1 input data set X ∈ Rm×nWhere X is a data set of m × n, each column of X representing oneData samples, i.e. X, comprising n samples, each sample having m dimensions, Xi∈RmI ∈ {1,2, … n }, m representing a sample dimension, n representing the number of samples of the dataset;
step 2: setting the number k of neighbors required for calculating the joint angle information as floor (5log10 (n));
and step 3: calculating local density information ρ ═ ρ for data points12,…ρn]For each sample xiCalculating the corresponding local density value as rho according to the formula (1)i
Figure GDA0002462577070000031
Wherein d isijRepresents a sample xiAnd sample xjOf the Euclidean distance between dcFor the cutoff distance, formula (1) is according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density value ρ ofi
And 4, step 4: calculating the joint angle information ζ ═ ζ12,…ζn]For each sample point xiCalculating the corresponding joint angle information value zetaiThe joint angle information value ζiIncluding a first local angle measureiAnd a second local angle measure τiCalculating the joint angle information value ζiThe method comprises the following steps:
step 41: data processing, for each data point xiI ∈ {1,2, … n }, selecting its k neighbors for normalization and decoupling operations, denoted as XiThe specific operation is shown as formula (2):
Figure GDA0002462577070000032
order to
Figure GDA0002462577070000033
Then
Figure GDA0002462577070000034
Step 42: calculating a first local angle measureiSample xiIs/are as followsiThe values are calculated as follows:
Figure GDA0002462577070000035
step 43: calculating a second local angle measure τiFor each sample xiAnd X corresponding theretoiCalculating X using the mean methodiIs approximated by a normal vector viI.e. vi=mean(Xi) (ii) a Calculating tau according to equation (3) based on the relation between normal and anglei
Figure GDA0002462577070000036
Step 44, for each data sample xiI ∈ {1,2, … n }, represented by the formula ζiiiCalculating a joint angle value;
and 5: judging the special points, comprising the following steps:
step 51: the threshold value y is determined and,
Figure GDA0002462577070000041
step 52: for sample point xiIf ρ isi,/γ < min (mean (. rho.)/γ, max (. zeta.)), then xiIs judged as an abnormal point; for each sample point, step 52 is repeated to determine whether it is an outlier or a normal point.
According to a preferred embodiment, the method of calculating local density values in step 3 comprises:
step 31, calculating the Euclidean distance between any two samples in the data set X, and obtaining a distance matrix D ∈ R after calculation aiming at the data set Xn×nWherein the element D in the distance matrix DijRepresents a sample xiAnd sample xjThe Euclidean distance between;
step 32: setting cutoff distance dcArranging D by sorting n × n elements in the distance matrix D from small to large according to the value of the n elements into a vector sd.cSelecting the p-th element value in the sd vector, wherein p is round ((n-1) n percent/100);
step 33: according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density value ρ ofiFrom the formula (1), when dijIs less than dcThen x is considered to bejIs the current point xiOtherwise it is not its neighbor.
According to a preferred embodiment, the method further comprises verifying the special points, including:
step 6, visual display, namely, for the data set X ∈ Rm×nWhen m is more than or equal to 1 and less than or equal to 3, marking and displaying the judged special point in the visual space to judge whether the special point is at the edge position of the data set, and carrying out X ∈ R on the data setm×nAnd when m is larger than 3, reducing the data set X to a visual space by adopting a classical dimension reduction method, and then performing label display.
The invention has the beneficial effects that:
1. the invention provides a stable method for detecting the abnormal points and the edge points based on the idea that the edge points and the abnormal points have lower local density and smaller angle variance change and the joint information of the angle and the density of the data set, so that the performance of detecting the edge points and the abnormal points is improved, the characteristics of the data set can be better reflected, the noise data can be detected, the noise can be better removed, and the defects of poor and unstable effect of detecting special points in a complex data set in the prior art are overcome.
2. The invention does not need to acquire prior information in advance from an unknown environment, overcomes the defect that the prior art needs to acquire the prior information, and ensures that the application range of the algorithm is wider.
Drawings
FIG. 1 is a flow chart of a method of determining a singularity according to the present invention;
FIG. 2 is a diagram illustrating the effect of step 41 after data processing;
FIG. 3 is a diagram illustrating the effect of the method of the present invention on the two-dimensional Flame data for the singular point determination;
FIG. 4 is a diagram of the effect of the method of the present invention on a two-dimensional XOR dataset on a special point decision;
FIG. 5 is a diagram showing the effect of the method of the present invention on the two-dimensional Banana data set on the decision of a singular point;
fig. 6 is a diagram showing the effect of the method of the present invention on the decision of a singular point for a two-dimensional PanelB dataset with noise.
Detailed Description
The following detailed description is made with reference to the accompanying drawings.
The data set in the present invention is a matrix of m × n dimensions, in which each column represents a sample.
The abnormal point in the present invention means: observation points that are suspected of being far from other points are considered to be from different data generation mechanisms.
The edge points in the present invention mean: points located on the edges of the data set of the higher density distribution.
The special points in the present invention include edge points and outliers.
The neighbors in the present invention are: a point that is similar to the current point by some measure.
The cutoff distance in the present invention means: the critical value.
The meaning of the local density value ρ in the present invention is: and counting the number of neighbors with the distance to the current point smaller than the critical value.
Joint angle value ζ in the present inventioniThe meaning of (A) is:iithe ratio of the two angular measures.
The meaning of the first local angle value in the present invention is:iis the variance transformation of the inner product of the neighbor vectors. The smaller the variance transformation, the less likely it is to be an outlier.
In the present inventionThe meaning of the second local angle value is: tau isiThe number of inner product positive values of the neighbor vector and the normal is counted. If the inner product value is that the regular included angle is an acute angle, the included angle is a neighbor in the included angle measurement. The larger the second local angle value is, the smaller the possibility that the current point is an outlier is.
The meaning of the joint measure in the present invention is: the variance transformation is smaller and more neighbors of included angles are jointly determined.iAnd 1/tauiWhen both are small, the more likely the current point is an outlier.
The invention provides a stable abnormal point and edge point detection method based on the thought that edge points and abnormal points have lower local density and smaller angle variance change, combines the joint characteristics of angles and densities of a data set, utilizes joint measure to judge the degree of sample points belonging to the abnormal points and the edge points, and automatically determines special points by setting a threshold value gamma, so that the performance of edge point and abnormal point detection is improved, and the characteristics of the data set are better reflected.
FIG. 1 is a flow chart of a method for determining a special point according to the present invention. As shown in fig. 1, the method for detecting outliers and edge points based on joint density and angle of the present invention includes the following steps:
step 1 input data set X ∈ Rm×nEach column in the matrix X represents a sample, i.e. X comprises n samples, each sample having m dimensions, i.e. Xi∈RmI ∈ {1,2, … n }. Each sample corresponds to a point in m-dimensional space, such as a 3-dimensional column vector, corresponding to a point in 3-dimensional space, i.e., the point can be represented by (X, y, z) coordinates.
Where X represents m × n dataset matrix X ═ (X)1,x2,x3…xn) Each column of which represents one data sample xi∈RmI ∈ {1,2, … n }, i.e., X, comprises n samples, each sample having an m dimension, R represents the data space, m represents the sample dimension, and n represents the number of samples of the data set.
For example, 10 pictures of size 20 × 20 can be processed into a column vector of size 400 × 1 for each image, and then 10 images form a data set X ∈ R400×10X is a matrix of size 400 × 10.
Step 2: the number k of neighbors required for calculating the joint angle information is set to floor (5log10 (n)). The floor (. cndot.) function is a rounded down function, e.g., floor (5.4) results in 5. Compared with the method for setting the number of the neighbors through manual experience, the method for automatically setting the number of the neighbors better combines the distribution characteristics of data and has better adaptability.
And step 3: calculating local density information ρ ═ ρ for data points12,…ρn]For each sample point xiCalculating the corresponding local density value as rho according to the formula (1)i
Figure GDA0002462577070000061
Wherein d isijRepresents a sample xiAnd sample xjOf the Euclidean distance between dcFor the cutoff distance, formula (1) is according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density value ρ ofi. Wherein z is dij-dc
The existing density detection algorithm needs to determine more parameters such as the number of nearest neighbors and the radius of a local area to a great extent. When the density difference of each cluster in the data set is large, and particularly when the high-dimensional data distribution is sparse, the performance of most density-type methods becomes worse. The technical scheme of the invention only adopts the unique parameter d when calculating the local density informationcAnd the external influence is reduced as much as possible, and the detection performance is further optimized.
Step 31: between any two samples in the calculation datasetThe Euclidean distance of (2) is calculated to obtain a distance matrix D ∈ R for the data set Xn×nWherein the element D in the distance matrix DijRepresents a sample xiAnd sample xjThe euclidean distance between them. The value range of i is 1 to n, and the value range of j is 1 to n.
Step 32: setting cutoff distance dcArranging D by sorting n × n elements in the distance matrix D from small to large according to the value of the n elements into a vector sd.cAnd (p) selecting the p-th element value in the sd vector. Where p is round ((n-1) n percent/100), percent is typically set to 0.2. round (·) is a rounding function.
Step 33: according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density value ρ ofiCalculating the local density value ρ according to equation (1)i. According to the formula (1), when dijIs less than dcThen x is considered to bejIs the current point xiOtherwise it is not its neighbor.
And 4, step 4: calculating the joint angle information ζ ═ ζ12,…ζn]For each sample point xiCalculating the corresponding joint angle value zetaiZeta, value of the joint angleiIncluding a first local angle valueiAnd a second local angle value tauiCalculating the joint angle value ζiThe method comprises the following steps:
step 41: data processing, for each data point xi∈RmI ∈ {1,2, … n }, selecting its k neighbors for normalization and decoupling operations, denoted as XiThe specific operation is shown as the following formula:
Figure GDA0002462577070000071
order to
Figure GDA0002462577070000072
Then
Figure GDA0002462577070000073
Most detection methods in the future rely heavily on distance measures, and especially in high-dimensional space, the separability of distance-based measures is poor. In order to eliminate the influence of the distance, a normalization and decoupling preprocessing method is adopted to eliminate the influence of the distance on the angle measurement as much as possible.
After step 41, x is advantageously treatediIs pulled to a unit circle, i.e. arbitrarily
Figure GDA0002462577070000074
To xiIs 1, the influence of the distance is removed.
Fig. 2 is an effect diagram after the data processing of step 41. As can be seen from fig. 2, xiAll neighbors to xiAre all 1.
Step 42: calculating a first local angle valueiSample xiIs/are as followsiThe values are calculated as follows:
Figure GDA0002462577070000075
due to the fact that
Figure GDA0002462577070000081
And
Figure GDA0002462577070000082
therefore, it is not only easy to use
Figure GDA0002462577070000083
Thus, the measure of the first local angle value may be converted into an inner product measure. Wherein theta isijRepresents the ith sample
Figure GDA0002462577070000084
And j sample
Figure GDA0002462577070000085
The included angle therebetween.
Figure GDA0002462577070000086
Refers to X in the formula (2)iThe var () function is a function that takes the variance.
In the general case of the above-mentioned,ithe smaller the value, xiThe greater the probability of becoming an edge point. However, some special edge points, such as the part points between the two classes, have higher valuesi. Because of the point in the middle of the two classes, when k is larger, its neighbors can be selected from the two classes instead of the single one, which results in a larger change in its angle.
In order to solve the problem that partial points between two data classes are wrongly judged as normal points to a certain extent, the invention provides a method for joint angle measurement detection.
Step 43: calculating a second local angle value τiFor each sample xiAnd X corresponding theretoiCalculating X using the mean methodiIs approximated by a normal vector viI.e. vi=mean(Xi). Calculating tau according to equation (3) based on the relation between normal and anglei
Figure GDA0002462577070000087
Step 44: for each data sample point xi∈RmI ∈ {1,2, … n }, represented by the formula ζiiiA joint angle value is calculated.
The invention adopts two angle measurement values to ensure that the robustness of the algorithm is better,iis based on the variance information, tau, of the inner product of the current point and the neighbor vectoriIs defined according to the angle value between the vector and the normal.
And 5: judging the abnormal points and the edge points, comprising the following steps:
step 51: the threshold value y is determined and,
Figure GDA0002462577070000088
step 52: for sample point xiIf ρ isi,/γ < min (mean (. rho.)/γ, max (. zeta.)), then xiIs judged as an abnormal point; for each sample point, step 52 is repeated to determine whether it is an outlier or a normal point.
The method for verifying the special points comprises the following steps:
step 6, visual display, namely, for the data set X ∈ Rm×nWhen m is more than or equal to 1 and less than or equal to 3, marking and displaying the judged special point in the visual space to judge whether the special point is at the edge position of the data set, and carrying out X ∈ R on the data setm×nAnd when m is larger than 3, the data set X is high-dimensional data, the data set X is reduced to a visual space by adopting a classical dimension reduction method, and then the data set X is marked and displayed.
The current dimension reduction algorithm mainly includes Principal Component Analysis (PCA), local linear embedding (L annular linear embedding, LL E), and placian feature mapping (L annular eigenmaps, L E).
The method for verifying the special points can also verify the feasibility of the special point judgment method by judging whether the searched special points can improve the performance of a clustering or classifying algorithm.
After finding out the special points and the edge points, data analysis can be performed to analyze the potentially valuable patterns in the data set, that is, the characteristics of the data, such as people who are between normal people and patients with liver diseases and who are about to get ill but not ill, the analysis of the special population will help to research the characteristics of the liver diseases, and the people who are at the edge of the liver diseases should attract attention of people. Therefore, the analysis of the special points has very important research significance.
In order to further illustrate the effect of the BPDAD algorithm provided by the invention on detecting special points in the data sets, different data sets are adopted for testing experiments.
The dots in the data set are shown as small circles, with the special dots detected being marked as filled dots by a number.
FIG. 3 is a graph of the results of detecting distinctive points using the BPDAD algorithm on a two-dimensional Flame dataset, as shown in FIG. 3, with the points marked with solid gray being the distinctive points detected using the BPDAD algorithm proposed by the present invention.
Fig. 4 is a result diagram of detecting special points by using the BPDAD algorithm for the two-dimensional XOR dataset, and as shown in fig. 4, the points marked with solid gray are the special points detected by using the BPDAD algorithm, and it can be seen from fig. 4 that the detected special points are all located at the edge of the dataset, and the detection effect is good.
Fig. 5 is a result diagram of detecting special points by using the BPDAD algorithm for the two-dimensional Banana data set, and as shown in fig. 5, points marked with solid gray are special points detected by using the BPDAD algorithm, and it can be seen from fig. 5 that the detected special points are all located at the edge of the data set, so that the detection effect is good.
Fig. 6 is a graph of the results of detecting a distinctive point using the BPDAD algorithm for a two-dimensional PanelB dataset with noise. As shown in fig. 6, the dots filled with gray are the special dots and noise data detected by the BPDAD algorithm, and it can be seen from fig. 6 that the BPDAD algorithm can not only detect the special dots located at the edge of the data set, but also effectively remove the noise data in the data set.
In order to objectively explain that the BPDAD (boundary-edge Detection with Angle analysis) algorithm provided by the invention can more accurately identify abnormal Points and special Points, special point Detection is carried out on the same data by respectively using a BEPS (boundary-edge Pattern Selection algorithm), an IDD algorithm and an IDD whole algorithm, and the effectiveness of the BPDAD algorithm in detecting the special Points is further verified by using K-Means and SMCE (sparse Man Cluster and embedding) Clustering algorithms. The experimental data are shown in table 1.
TABLE 1
Method #REMOVED BOUNDARY POINTS CLUSTER P(%)
KMEANS 70.22
BEPS 43 KMEANS 68.89
IDD 0 KMEANS 70.22
IDDWHOLE 0 53.37
BPDAD 95 KMEANS 73.49
SMCE 70.22
IDD 0 SMCE 69.10
BEPS 43 SMCE 62.96
BPDAD 95 SMCE 63.86
The test data in table 1 is high-dimensional Wine data, "REMOVED BOUNDARY POINTS" in table 1 represents the number of special POINTS detected by using a corresponding method, "C L user" in table 1 represents the clustering algorithm used, such as K-Means or SMCE clustering algorithm, "P (%)" in table 1 represents the clustering precision, and the higher the value of P (%), the better the effectiveness of detecting special POINTS.
It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims (2)

1. An outlier and edge point detection method based on joint density and angle, characterized in that the method comprises the following steps:
step 1 input data set X ∈ Rm×nWhere X is a data set of m × n, each column of X representing a data sample, i.e. X comprises n samples, each sample having m dimensions, Xi∈RmI ∈ {1,2, … n }, m representing a sample dimension, n representing the number of samples of the dataset, dataset X being a text or numerical image;
step 2: setting the number k of neighbors required in calculation of the joint angle information as floor (5log10(n)), wherein a floor (.) function is a function of rounding downwards, and the method for automatically setting the number of neighbors combines the distribution characteristics of data;
and step 3: calculating local density information rho ═ rho of sample points1,ρ2,…ρn]For each sample point xiCalculating the corresponding local density value as rho according to the formula (1)i
Figure FDA0002462577060000011
Wherein d isijRepresents a sample point xiAnd sample point xjOf the Euclidean distance between dcFor the cutoff distance, formula (1) is according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density value ρ ofi
And 4, step 4: calculating the joint angle information ζ ═ ζ1,ζ2,…ζn]For each sample point xiCalculating the corresponding joint angle information value zetaiThe joint angle information value ζiIncluding a first local angle measureiAnd a second local angle measure τiCalculating the joint angle information value ζiThe method comprises the following steps:
step 41: data processing, for each sample point xiI ∈ {1,2, … n }, selecting its k neighbors for normalization and decoupling operations, denoted as XiThe specific operation is shown as formula (2):
Figure FDA0002462577060000012
order to
Figure FDA0002462577060000013
Then
Figure FDA0002462577060000014
B is xiIs pulled to a unit circle, i.e. arbitrarily
Figure FDA0002462577060000015
To xiThe distance of (2) is 1, so that the influence of the distance is eliminated;
step 42: calculating a first local angle measureiSample point xiIs/are as followsiThe values are calculated as follows:
Figure FDA0002462577060000016
Figure FDA0002462577060000017
refers to X in the formula (2)iThe Var () function is a function taking the variance;
step 43: calculating a second local angle measure τiFor each sample point xiAnd X corresponding theretoiCalculating X using the mean methodiIs approximated by a normal vector viI.e. vi=mean(Xi) (ii) a Calculating tau according to equation (3) based on the relation between normal and anglei
Figure FDA0002462577060000021
Step 44, for each sample point xiI ∈ {1,2, … n }, represented by the formula
Figure FDA0002462577060000022
Calculating a joint angle information value ζi
And 5: judging the special points, comprising the following steps:
step 51: the threshold value y is determined and,
Figure FDA0002462577060000023
step 52: for sample point xiIf, if
Figure FDA0002462577060000024
X is theniIs judged as an abnormal point; for each sample point xiRepeatedly executing step 52 to determine that the detected value is an abnormal point or a normal point;
step 6: the method for verifying the special points is as follows:
visual display of the data set X ∈ Rm×nWhen m is more than or equal to 1 and less than or equal to 3, marking and displaying the judged special points in the visual space to judge whether the special points are positioned on the edge position of the data set, and carrying out X ∈ R treatment on the data setm×nAnd when m is larger than 3, reducing the data set X to a visual space by adopting a classical dimension reduction method, and then performing label display.
2. The inspection method of claim 1, wherein the method of calculating the local density values in step 3 comprises:
step 31, calculating the Euclidean distance between any two samples in the data set X, and obtaining a distance matrix D ∈ R after calculation aiming at the data set Xn×nWherein the element D in the distance matrix DijRepresents a sample point xiAnd sample point xjThe Euclidean distance between;
step 32: setting cutoff distance dcSorting n × n elements in the distance matrix D into a vector sd according to the size of the values from small to large, and setting DcSelecting the p-th element value in the sd vector, wherein p is round ((n-1) n percent/100), round () is rounding function, and percent is 0.2;
step 33: according to dijAnd dcThe distance relation of (2) and the statistics of the current sample point xiIs taken as xiLocal density ofValue rhoiFrom the formula (1), when dijIs less than dcThen x is considered to bejIs the current sample point xiOtherwise it is not its neighbor.
CN201710548763.3A 2017-07-07 2017-07-07 Abnormal point and edge point detection method based on joint density and angle Active CN107341514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710548763.3A CN107341514B (en) 2017-07-07 2017-07-07 Abnormal point and edge point detection method based on joint density and angle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710548763.3A CN107341514B (en) 2017-07-07 2017-07-07 Abnormal point and edge point detection method based on joint density and angle

Publications (2)

Publication Number Publication Date
CN107341514A CN107341514A (en) 2017-11-10
CN107341514B true CN107341514B (en) 2020-07-21

Family

ID=60219167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710548763.3A Active CN107341514B (en) 2017-07-07 2017-07-07 Abnormal point and edge point detection method based on joint density and angle

Country Status (1)

Country Link
CN (1) CN107341514B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921192B (en) * 2018-05-25 2020-01-21 成都信息工程大学 Abnormal point detection method based on geodesic distance
CN108563217A (en) * 2018-05-29 2018-09-21 济南浪潮高新科技投资发展有限公司 The robust method for detecting abnormality analyzed based on part and global statistics
CN108921202A (en) * 2018-06-12 2018-11-30 成都信息工程大学 A kind of abnormal point detecting method based on data structure
CN110046665A (en) * 2019-04-17 2019-07-23 成都信息工程大学 Based on isolated two abnormal classification point detecting method of forest, information data processing terminal

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318241A (en) * 2014-09-25 2015-01-28 东莞电子科技大学电子信息工程研究院 Local density spectral clustering similarity measurement algorithm based on Self-tuning
CN105930862A (en) * 2016-04-13 2016-09-07 江南大学 Density peak clustering algorithm based on density adaptive distance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318241A (en) * 2014-09-25 2015-01-28 东莞电子科技大学电子信息工程研究院 Local density spectral clustering similarity measurement algorithm based on Self-tuning
CN105930862A (en) * 2016-04-13 2016-09-07 江南大学 Density peak clustering algorithm based on density adaptive distance

Also Published As

Publication number Publication date
CN107341514A (en) 2017-11-10

Similar Documents

Publication Publication Date Title
CN107341514B (en) Abnormal point and edge point detection method based on joint density and angle
US10896351B2 (en) Active machine learning for training an event classification
JP6710135B2 (en) Cell image automatic analysis method and system
Kuncheva et al. PCA feature extraction for change detection in multidimensional unlabeled data
WO2018081929A1 (en) Hyperspectral remote sensing image feature extraction and classification method and system thereof
Ramalho et al. Rotation-invariant feature extraction using a structural co-occurrence matrix
CN109117826A (en) A kind of vehicle identification method of multiple features fusion
Blouvshtein et al. Outlier detection for robust multi-dimensional scaling
EP3008663A1 (en) Method for detecting a plurality of instances of an object
JP4376145B2 (en) Image classification learning processing system and image identification processing system
JP2019016249A (en) Determining apparatus, determining method, and determining program
Horak et al. Classification of SURF image features by selected machine learning algorithms
Ribeiro et al. Automatic segmentation of breast masses using enhanced ICA mixture model
Mercioni et al. A survey of distance metrics in clustering data mining techniques
Kumar et al. A novel approach for segmentation and classification of brain MR images using cluster deformable based fusion approach.
Wang et al. Detection and recognition of mixed-type defect patterns in wafer bin maps via tensor voting
Liu et al. The scale of edges
CN110929801B (en) Improved Euclid distance KNN classification method and system
CN111144469B (en) End-to-end multi-sequence text recognition method based on multi-dimensional associated time sequence classification neural network
Zhang et al. Robust physics discovery via supervised and unsupervised pattern recognition using the Euler Characteristic
CN110110795B (en) Image classification method and device
Tang et al. Manufacturing deviation inspection method for ship block alignment structures based on terrestrial laser scanner data
Wu et al. A Systematic Point Cloud Edge Detection Framework for Automatic Aircraft Skin Milling
Peng et al. Interpreting the curse of dimensionality from distance concentration and manifold effect
Ge et al. Type-based outlier removal framework for point clouds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant