CN110929801A - Improved Euclid distance KNN classification method and system - Google Patents

Improved Euclid distance KNN classification method and system Download PDF

Info

Publication number
CN110929801A
CN110929801A CN201911215801.9A CN201911215801A CN110929801A CN 110929801 A CN110929801 A CN 110929801A CN 201911215801 A CN201911215801 A CN 201911215801A CN 110929801 A CN110929801 A CN 110929801A
Authority
CN
China
Prior art keywords
sample
training
projection
class
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911215801.9A
Other languages
Chinese (zh)
Other versions
CN110929801B (en
Inventor
徐承俊
朱国宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201911215801.9A priority Critical patent/CN110929801B/en
Publication of CN110929801A publication Critical patent/CN110929801A/en
Application granted granted Critical
Publication of CN110929801B publication Critical patent/CN110929801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a classification method and a classification system based on an improved Euclid distance KNN, which comprises the steps of firstly obtaining a data set from a database, and dividing the data set into a test set and a training set; setting a neighbor parameter K value; calculating a projection vector w according to an LDA (Linear Discriminant analysis) algorithm; constructing a neighbor graph G (V, E) from the training set; for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set; return pair data sample xtextIs estimated value of
Figure DDA0002299461640000011
And the determination of the sample class is made. The invention has the following advantages: (1) the noise-resistant KNN has good noise resistance, and can solve the problem that the traditional KNN is sensitive to noise. (2) The method adopts the improved Euclid distance to replace the Euclid distance measurement adopted by the traditional KNN, can distinguish samples more, improves the accuracy of classification, and does not increase the complexity of calculation.

Description

Improved Euclid distance KNN classification method and system
Technical Field
The invention relates to the technical field of data classification, in particular to a classification method and system based on an improved Euclid distance KNN.
Background
In the current big data era, various data are large in scale and wide in range, and data need to be classified and processed so as to facilitate subsequent further analysis and processing. The data are classified by using a KNN algorithm, and the basic idea of the KNN algorithm is as follows: nearest K neighbors to any given sample to be classified, then the rootAnd determining the category of the K adjacent neighbors according to the classification attribute voting. The distance measurement method of the KNN algorithm mainly adopts the Euclid distance (Euclidean distance) of a sample to be measured and a training sample. The KNN algorithm assumes that all samples correspond to the n-dimensional space RnThe nearest neighbor of a sample is defined according to the standard Euclid distance. The KNN algorithm is only related to a very small number of adjacent samples when the class is judged, the class is mainly determined by the limited adjacent samples around rather than by a method for judging the class domain, and therefore, the KNN algorithm is more suitable for a sample set to be detected with more overlapping or crossing of the class domain than other classification methods.
The KNN algorithm is an inert learning method, so that the problems of low classification speed, strong sample library capacity dependence and the like exist, the KNN algorithm adopts Euclid measurement, the measurement standard for calculating the distance is sensitive to noise characteristics, and when the sample data size is large, particularly under the condition that a sample contains noise, the classification is easy to cause inaccuracy, the data processing efficiency is low and the like.
Disclosure of Invention
The invention provides an improved Euclid distance KNN classification method, which is used for solving the problem that the metric for calculating the distance in the background technology is sensitive to noise characteristics.
In order to achieve the above object, the present invention provides an improved euclidd distance KNN-based classification method, which comprises the following specific steps:
step1, acquiring a data set from the database, and dividing the data set into a test set and a training set;
step2, setting a neighbor parameter K value;
step3, solving a projection vector w of a training set according to a Linear Discriminant Analysis algorithm;
step4, constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;
step5, for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set;
step6, return data sample xtextIs estimated value of
Figure BDA0002299461620000021
Wherein the content of the first and second substances,
Figure BDA0002299461620000022
f(xi) Problem function, x, representing a classificationiRepresents the ith training sample, V represents the class corresponding to the training sample, and V ═ V1,v2,…,vsV denotes a set of data categories,
Figure BDA0002299461620000023
is the data sample xtextIn the final category of (a) to (b),
Figure BDA0002299461620000024
further, Step2 sets K to 1,3,5,7,9,11,13, 15.
Further, the projection vector w in Step3 is calculated as follows,
taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:
given N training samples characterized by d dimensions
Figure BDA0002299461620000025
First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,
Figure BDA0002299461620000026
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the projection of the training samples x to w is represented by y ═ wTx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:
Figure BDA0002299461620000027
therefore, the projected average value is the projection of the center point of the sample;
the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure BDA0002299461620000028
obtaining a hash value of the projected class, specifically:
Figure BDA0002299461620000029
final pass metric formula
Figure BDA00022994616200000210
Measuring a projection vector w;
according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:
expanding the hash value formula:
Figure BDA00022994616200000211
wherein order
Figure BDA00022994616200000212
Namely a hash matrix;
then, let Sw=S1+S2,SwCalled the intra-class dispersion degree matrix, SB=(μ12)(μ12)T,SBCalled inter-class dispersion degree matrix;
j (w) is finally expressed as:
Figure BDA0002299461620000031
performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | wTSWW | | | 1, after adding lagrange multiplier, the derivation:
Figure BDA0002299461620000032
it follows that w is a matrix
Figure BDA0002299461620000033
The feature vector of (2);
in particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure BDA0002299461620000034
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicitywTo obtain
Figure BDA0002299461620000036
Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.
Further, in Step4, the size of the edge in the neighbor graph is specifically represented by the formula:
Figure BDA0002299461620000035
determination of where xlThe l-th feature vector, x, representing a training sample xi,xjRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.
Further, the value of m is 5, and the value of m respectively comprises a stroke, a contour, a cross point, an end point and a gray level feature vector of the image.
The invention also provides an improved Euclid distance KNN classification system, which comprises the following modules:
the data set acquisition module is used for acquiring a data set from a database and dividing the data set into a test set and a training set;
the parameter setting module is used for setting a neighbor parameter K value;
the projection vector w solving module is used for solving a training set projection vector w according to a Linear discriminatant Analysis algorithm;
the neighbor graph constructing module is used for constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;
k neighbor search modules for each data sample x in the test settextFinding K neighbors in the training set;
a sample class determination module for returning the data sample xtextIs estimated value of
Figure BDA0002299461620000041
Wherein the content of the first and second substances,
Figure BDA0002299461620000042
f(xi) Problem function, x, representing a classificationiRepresents the ith training sample, v represents the class corresponding to the training sample,
Figure BDA0002299461620000043
is the data sample xtextIn the final category of (a) to (b),
Figure BDA0002299461620000044
further, the setting K in the parameter setting module is 1,3,5,7,9,11,13, 15.
Furthermore, the projection vector w in the projection vector w solving module is calculated as follows,
taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:
given N training samples characterized by d dimensions
Figure BDA0002299461620000045
First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,
Figure BDA0002299461620000046
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the projection of the training samples x to w is represented by y ═ wTx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:
Figure BDA0002299461620000047
therefore, the projected average value is the projection of the center point of the sample;
the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure BDA0002299461620000048
obtaining a hash value of the projected class, specifically:
Figure BDA0002299461620000049
final pass metric formula
Figure BDA00022994616200000410
Measuring a projection vector w;
according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:
expanding the hash value formula:
Figure BDA00022994616200000411
wherein order
Figure BDA00022994616200000412
Namely a hash matrix;
then, let Sw=S1+S2,SwCalled the intra-class dispersion degree matrix, SB=(μ12)(μ12)T,SBCalled inter-class dispersion degree matrix;
j (w) is finally expressed as:
Figure BDA00022994616200000413
performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | wTSWW | | | 1, add lagrangeAfter multiplier, derivation:
Figure BDA0002299461620000051
it follows that w is a matrix
Figure BDA0002299461620000052
The feature vector of (2);
in particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure BDA0002299461620000053
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicitywTo obtain
Figure BDA0002299461620000054
Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.
Further, in the neighbor graph constructing module, the size of the edge in the neighbor graph is specifically represented by the formula:
Figure BDA0002299461620000055
determination of where xlThe l-th feature vector, x, representing a training sample xi,xjRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.
Further, the value of m is 5, and the value of m respectively comprises a stroke, a contour, a cross point, an end point and a gray level feature vector of the image.
Compared with the prior art, the invention has the following beneficial effects: the invention provides an improved Euclidid distance KNN classification method, which comprises the steps of pre-setting a neighbor parameter K, calculating a projection vector w according to an LDA (Linear cognitive analysis) algorithm, constructing a training data set into a neighbor graph G (V, E), wherein G represents the neighbor graph, V represents a node, namely each data sample, and E is represented by an E tableShowing the edges connecting the data samples, wherein the size of the edges is specifically represented by a formula:
Figure BDA0002299461620000056
wherein x islValues representing the l-th feature of the sample x, xi,xjRespectively representing the ith and jth samples, t representing an arbitrary constant, w representing the projection vector, x for each data sample in the test settextFinding K neighbors in the training set, the return value of the KNN algorithm
Figure BDA0002299461620000057
Is to the data sample xtextEstimation of class, i.e. distance sample xtextAnd performing class judgment on the most common f value in the latest K training samples. Since the traditional KNN algorithm adopts Euclid measurement, and the distance calculation measurement standard is sensitive to noise characteristics, the improved Euclid distance of the method replaces the traditional Euclid distance to improve the KNN algorithm. The method has good distinguishability, noise immunity and robustness of the projection vector of the LDA algorithm, can distinguish multidimensional data and carry out good classification, can keep high resolution and good calculation performance, and can be used as a reference for similar KNN research.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the present invention is further described below with reference to the accompanying drawings and the embodiments.
FIG. 1 is a simplified flow chart of the classification method based on the improved Euclid distance KNN according to the present invention;
FIG. 2 is a schematic view of a training sample of the present invention projected onto a straight line;
FIG. 3 is a schematic view of a sample center projection of the present invention;
FIG. 4 is a diagram illustrating the present invention using LDA to solve the optimal projection vector w;
FIG. 5 is a graphical representation of the classification performance of the USPS data set in accordance with the present invention;
fig. 6 is a diagram illustrating the classification performance of the MNIST data set according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The detailed description of the embodiments of the present invention generally described and illustrated in the figures herein is not intended to limit the scope of the invention, which is claimed, but is merely representative of selected embodiments of the invention.
It should be noted that: like reference symbols in the following drawings indicate like items, and thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.
Referring to fig. 1, fig. 1 is a simplified flow chart of the classification method based on the improved euclidd distance KNN according to the present invention. The embodiment is particularly applicable to classification of data, and the embodiment of the invention is executed in a development environment of lie group machine learning.
Step1, the present embodiment downloads a USPS data set over the network, where the data set includes 10 categories, each of which is a 0-9 category number, and includes 20000 ten thousand pictures, and each picture is an image of 32 × 32 (unit: pixel). An MNIST data set is downloaded from a network, and the data set comprises 10 categories of numbers of 0-9 categories respectively and 70000 pictures in total, wherein each picture is an image with the size of 28 x 28 (unit: pixel). Furthermore, the classification test is carried out on the two data sets, and the two data sets are respectively divided into a training data set and a test data set by programming through matlab language.
It should be noted that the picture data in this embodiment has the following advantages: (1) the data volume is large, the categories are many, and the method is necessary for the plum blossom machine learning. (2) The diversity of the sample images, which is a standard data set in this embodiment, covers various handwritten data volumes, the sample images have diversity, and the images in the data set are strictly screened for different angles, illumination and definition, so that the observation angles and the like of each category of images have larger differences.
Step2, setting a neighbor parameter K value, wherein K in the method is 1,3,5,7,9,11,13 and 15;
step3, calculating a projection vector w of a training set according to an LDA (Linear Discriminant analysis) algorithm;
given N training samples characterized by d dimensions
Figure BDA0002299461620000071
In which there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2
Reducing the dimension of the d-dimensional feature and ensuring that data feature information is not lost after the dimension reduction, namely, determining the class of each sample after the dimension reduction, and referring the optimal vector as w (d dimension), wherein the projection of the training sample x (d dimension) onto w can be represented by y ═ wTAnd x is calculated.
For simplicity and easy understanding, in the present invention, we first consider the case where the training sample x is two-dimensional, and intuitively see, as shown in fig. 2, the circular and triangular part tables represent two different types of training samples, the training sample x is two-dimensional and includes two feature values, x1 represents one feature value, x2 represents another feature value, the straight line obtained is a straight line capable of separating the two types of training samples, and the straight line y in fig. 2 is wTx may well separate training samples of different classes. This is actually the idea of LDA: maximizing inter-class variance and minimizing intra-class variance, i.e., reducing the differences between the intra-class, broaden the differences between different classes.
The specific process of the quantitative analysis to find the optimal w is described below.
First, find the mean (center point) of each class of training samples, where i is only two (i ═ 1, 2):
Figure BDA0002299461620000072
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the mean of the sample points after x to w projection is given by:
Figure BDA0002299461620000073
the meaning of each symbol is consistent with the above description, and therefore, the projected mean value is the projection of the center point of the training sample.
The straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure BDA0002299461620000074
the larger the J (w), the better.
In practical applications, J (w) is not as large as possible, and as shown in FIG. 3, the sample points are uniformly distributed in an ellipse projected to the horizontal axis x1A larger center point separation j (w) can be achieved in the above, but sample points cannot be separated on the x-axis due to the overlap. Projected onto the longitudinal axis x2In the above, although j (w) is small, the sample point can be separated. Therefore, we must also consider the variance between sample points, the larger the variance, the more difficult it is to separate the sample points.
The projected class is hashed using another metric, called hash (scatter), specifically:
Figure BDA0002299461620000081
the geometric meaning of a hash value is the density of sample points, with larger values being more dispersed and vice versa more concentrated.
In the present invention, it is necessary to separate different sample points better, and the more similar samples are gathered better, i.e. the larger the mean difference is, the better the hash value is. Measured using j (w) and S, the measurement formula:
Figure BDA0002299461620000082
according to the above formula, it is necessary to find w that maximizes J (w).
Expanding the hash value formula:
Figure BDA0002299461620000083
wherein order
Figure BDA0002299461620000084
I.e. a hash matrix.
Then, let Sw=S1+S2,SwCalled the Within-class dispersion degree matrix (Within-class scatter matrix). SB=(μ12)(μ12)T,SBReferred to as the inter-class dispersion matrix (Between-class scanner matrix).
J (w) is finally expressed as:
Figure BDA0002299461620000085
and (4) carrying out derivation on the derivative, carrying out normalization processing on the denominator before derivation, and if the normalization processing is not carried out, w is expanded by any multiple, and the formula is established, the w cannot be determined. Therefore, in the present invention, let | | wTSWW | | | 1, after adding lagrange multiplier, the derivation:
Figure BDA0002299461620000086
it follows that w is a matrix
Figure BDA0002299461620000087
The feature vector of (2).
In particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure BDA0002299461620000088
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides can be reduced for simplicitywTo obtain
Figure BDA0002299461620000089
We need only find the mean and equation of the original samples to find the best w, as shown in fig. 4.
The above conclusions, although coming from 2 dimensions, are true for multi-dimensions as well. The feature vector segmentation performance corresponding to the large feature value is the best.
Step4, constructing a neighbor graph G (V, E) according to the training set;
constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each data sample, E represents an edge connecting each data sample, and the size of the edge is specifically represented by a formula:
Figure BDA0002299461620000091
wherein x islThe method comprises the steps that the ith characteristic vector of a training sample x is represented, m refers to the number of the characteristic vectors, the value of m is related to the selection of a data set, the characteristic vectors mainly take 5 strokes, outlines, cross points, end points and gray levels of images, and the solution of the characteristic vectors is the prior art and is not written in the invention; x is the number ofi,xjRespectively, the ith sample and the jth sample, t represents an arbitrary constant, and w represents the projection vector.
Step5, for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set;
step6, return data sample xtextIs estimated value of
Figure BDA0002299461620000092
And the determination of the sample class is made.
The present invention discusses the case where the objective function is a discrete value (classification problem), i.e. the classification problem function can be described as: f is Rn→ V, where V ═ V1,v2,…,vsV represents a set of data categories, corresponding to s classes. Estimation of KNN algorithm
Figure BDA0002299461620000093
Is to the data sample xtextEstimation of class, i.e. distance sample xtextThe most common f-number of the most recent K training samples:
Figure BDA0002299461620000094
wherein the content of the first and second substances,
Figure BDA0002299461620000095
is the data sample xtextFinal class of f (x)i) Problem function, x, representing a classificationiRepresents the ith training sample, v represents the class corresponding to the training sample,
Figure BDA0002299461620000096
table 1 shows the comparison of classification performance of the inventive method with the conventional KNN classification method on USPS datasets. As can be seen from the table, the classification accuracy of the method is obviously higher than that of the traditional KNN classification method.
TABLE 1 comparison of classification Performance of the inventive method with other methods on USPS datasets
Figure BDA0002299461620000101
Table 2 shows the comparison of the classification performance of the inventive method with the conventional KNN classification method on the MNIST dataset. As can be seen from the table, the classification accuracy of the method is obviously higher than that of the traditional KNN classification method.
TABLE 2 comparison of classification performance of the method of the present invention on MNIST datasets with other methods
Figure BDA0002299461620000102
With reference to fig. 5 to 6, fig. 5 is a classification performance diagram of the USPS data set according to the embodiment of the present invention, and fig. 6 is a classification performance diagram of the MNIST data set according to the embodiment of the present invention. FIG. 5 is applied to the USPS data set, and the average classification accuracy is 96%, while the average classification accuracy of the conventional KNN is 72%, which is 24% higher than that of the method provided by the present invention; fig. 6 is applied to MNIST datasets with an average classification accuracy of 95%, whereas the conventional KNN average classification accuracy is 88%, which is 7% higher than the method proposed by the present invention. The statistical result shows that the method of the invention is obviously superior to the traditional KNN method and has strong practicability.
The invention also provides an improved Euclid distance KNN classification system, which comprises the following modules:
the data set acquisition module is used for acquiring a data set from a database and dividing the data set into a test set and a training set;
the parameter setting module is used for setting a neighbor parameter K value;
the projection vector w solving module is used for solving a training set projection vector w according to a Linear discriminatant Analysis algorithm;
the neighbor graph constructing module is used for constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;
k neighbor search modules for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set;
a sample class determination module for returning the data sample xtextIs estimated value of
Figure BDA0002299461620000111
Wherein the content of the first and second substances,
Figure BDA0002299461620000112
f(xi) Problem function, x, representing a classificationiRepresents the ith training sample, V represents the class corresponding to the training sample, and V ═ V1,v2,…,vsV denotes a set of data categories,
Figure BDA0002299461620000113
is the data sample xtextIn the final category of (a) to (b),
Figure BDA0002299461620000114
and setting K in the parameter setting module to be 1,3,5,7,9,11,13 and 15.
Wherein, the calculation mode of the projection vector w in the projection vector w solving module is as follows,
taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:
given N training samples characterized by d dimensions
Figure BDA0002299461620000115
First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,
Figure BDA0002299461620000116
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the projection of the training samples x to w is represented by y ═ wTx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:
Figure BDA0002299461620000117
therefore, the projected average value is the projection of the center point of the sample;
the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure BDA0002299461620000121
obtaining a hash value of the projected class, specifically:
Figure BDA0002299461620000122
final pass metric formula
Figure BDA0002299461620000123
Measuring a projection vector w;
according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:
expanding the hash value formula:
Figure BDA0002299461620000124
wherein order
Figure BDA0002299461620000125
Namely a hash matrix;
then, let Sw=S1+S2,SwCalled the intra-class dispersion degree matrix, SB=(μ12)(μ12)T,SBCalled inter-class dispersion degree matrix;
j (w) is finally expressed as:
Figure BDA0002299461620000126
performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | wTSWW | | | 1, after adding lagrange multiplier, the derivation:
Figure BDA0002299461620000127
it follows that w is a matrix
Figure BDA0002299461620000128
The feature vector of (2);
in particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure BDA0002299461620000129
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicitywTo obtain
Figure BDA00022994616200001210
Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.
In the neighbor graph constructing module, the size of the edge in the neighbor graph is specifically represented by a formula:
Figure BDA00022994616200001211
determination of where xlThe l-th feature vector, x, representing a training sample xi,xjRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.
The specific implementation of each module corresponds to each step, and the invention is not described.
The above description is only a part of the embodiments of the present invention, and is not intended to limit the present invention, and it will be apparent to those skilled in the art that various modifications can be made in the present invention. Any changes, equivalent substitutions or improvements made within the spirit and principle of the present invention should be included within the scope of the present invention. Note that like reference numerals and letters denote like items in the following drawings. Thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.

Claims (10)

1. A classification method based on an improved Euclid distance KNN is characterized by comprising the following steps:
step1, acquiring a data set from the database, and dividing the data set into a test set and a training set;
step2, setting a neighbor parameter K value;
step3, solving a projection vector w of a training set according to a Linear Discriminant Analysis algorithm;
step4, constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;
step5, for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set;
step6, return data sample xtextIs estimated value of
Figure FDA0002299461610000011
Wherein the content of the first and second substances,
Figure FDA0002299461610000012
f(xi) Problem function, x, representing a classificationiRepresents the ith training sample, V represents the class corresponding to the training sample, V represents the set of data classes,
Figure FDA0002299461610000013
is the data sample xtextIn the final category of (a) to (b),
Figure FDA0002299461610000014
2. the improved Euclid distance KNN classification method according to claim 1, characterized in that: step2 sets K to 1,3,5,7,9,11,13, 15.
3. The improved Euclid distance KNN classification method according to claim 1, characterized in that: the projection vector w in Step3 is calculated as follows,
taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:
given N training samples characterized by d dimensions
Figure FDA0002299461610000015
First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,
Figure FDA0002299461610000016
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the projection of the training samples x to w is represented by y ═ wTx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:
Figure FDA0002299461610000017
therefore, the projected average value is the projection of the center point of the sample;
the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure FDA0002299461610000021
obtaining a hash value of the projected class, specifically:
Figure FDA0002299461610000022
final pass metric formula
Figure FDA0002299461610000023
Measuring a projection vector w;
according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:
expanding the hash value formula:
Figure FDA0002299461610000024
wherein order
Figure FDA0002299461610000025
Namely a hash matrix;
then, let Sw=S1+S2,SwCalled the intra-class dispersion degree matrix, SB=(μ12)(μ12)T,SBCalled inter-class dispersion degree matrix;
j (w) is finally expressed as:
Figure FDA0002299461610000026
performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | wTSWW | | | 1, after adding lagrange multiplier, the derivation:
Figure FDA0002299461610000027
it follows that w is a matrix
Figure FDA0002299461610000028
The feature vector of (2);
in particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure FDA0002299461610000029
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicitywTo obtain
Figure FDA00022994616100000210
Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.
4. The improved Euclid distance KNN classification method according to claim 1, characterized in that: in Step4, the size of the edge in the neighbor graph is specifically represented by the formula:
Figure FDA00022994616100000211
determination of where xlThe l-th feature vector, x, representing a training sample xi,xjRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.
5. The improved Euclid distance KNN classification method according to claim 4, characterized in that: the value of m is 5, and the values respectively comprise strokes, outlines, cross points, end points and gray level feature vectors of the image.
6. An improved Euclid distance KNN classification system is characterized by comprising the following modules:
the data set acquisition module is used for acquiring a data set from a database and dividing the data set into a test set and a training set;
the parameter setting module is used for setting a neighbor parameter K value;
the projection vector w solving module is used for solving a training set projection vector w according to a Linear discriminatant Analysis algorithm;
the neighbor graph constructing module is used for constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;
k neighbor search modules for each data sample x in the test settextFinding data sample x from neighbor maptextK neighbors in the training set;
a sample class determination module for returning the data sample xtextIs estimated value of
Figure FDA0002299461610000031
Wherein the content of the first and second substances,
Figure FDA0002299461610000032
f(xi) Problem function, x, representing a classificationiRepresents the ith training sample, V represents the class corresponding to the training sample, V represents the set of data classes,
Figure FDA0002299461610000033
is the data sample xtextIn the final category of (a) to (b),
Figure FDA0002299461610000034
7. the improved Euclid distance KNN classification system according to claim 6, characterized in that: and setting K in the parameter setting module to be 1,3,5,7,9,11,13 and 15.
8. The improved Euclid distance KNN classification system according to claim 6, characterized in that: the projection vector w in the projection vector w solving module is calculated as follows,
taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:
given N training samples characterized by d dimensions
Figure FDA0002299461610000035
First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,
Figure FDA0002299461610000036
specifically, there is N1A training sample belonging to the category w1Having N of2A training sample belonging to the category w2,N=N1+N2,μiRepresenting the mean of the ith class of training samples;
the projection of the training samples x to w is represented by y ═ wTx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:
Figure FDA0002299461610000037
therefore, the projected average value is the projection of the center point of the sample;
the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:
Figure FDA0002299461610000038
obtaining a hash value of the projected class, specifically:
Figure FDA0002299461610000039
final pass metric formula
Figure FDA0002299461610000041
Measuring a projection vector w;
according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:
expanding a hash value formula:
Figure FDA0002299461610000042
Wherein order
Figure FDA0002299461610000043
Namely a hash matrix;
then, let Sw=S1+S2,SwReferred to as the internal dispersion degree matrix, SB=(μ12)(μ12)T,SBCalled inter-class dispersion degree matrix;
j (w) is finally expressed as:
Figure FDA0002299461610000044
performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | wTSWW | | | 1, after adding lagrange multiplier, the derivation:
Figure FDA0002299461610000045
it follows that w is a matrix
Figure FDA0002299461610000046
The feature vector of (2);
in particular, because of SBw=(μ12)(μ12)Tw, where the product of the latter two terms is a constant, denoted λwThen, then
Figure FDA0002299461610000047
Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicitywTo obtain
Figure FDA0002299461610000048
Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.
9. The improved Euclid distance KNN classification system according to claim 6, characterized in that: in the neighbor graph constructing module, the size of the edge in the neighbor graph is specifically represented by a formula:
Figure FDA0002299461610000049
determination of where xlThe l-th feature vector, x, representing a training sample xi,xjRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.
10. The improved Euclid distance KNN classification system according to claim 9, characterized in that: the value of m is 5, and the values respectively comprise strokes, outlines, cross points, end points and gray level feature vectors of the image.
CN201911215801.9A 2019-12-02 2019-12-02 Improved Euclid distance KNN classification method and system Active CN110929801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911215801.9A CN110929801B (en) 2019-12-02 2019-12-02 Improved Euclid distance KNN classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911215801.9A CN110929801B (en) 2019-12-02 2019-12-02 Improved Euclid distance KNN classification method and system

Publications (2)

Publication Number Publication Date
CN110929801A true CN110929801A (en) 2020-03-27
CN110929801B CN110929801B (en) 2022-05-13

Family

ID=69848393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911215801.9A Active CN110929801B (en) 2019-12-02 2019-12-02 Improved Euclid distance KNN classification method and system

Country Status (1)

Country Link
CN (1) CN110929801B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613184A (en) * 2020-12-29 2021-04-06 煤炭科学研究总院 Artificial intelligence method for judging distance of side slope collapse rockfall during earthquake occurrence
CN113162926A (en) * 2021-04-19 2021-07-23 西安石油大学 KNN-based network attack detection attribute weight analysis method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030051554A (en) * 2003-06-07 2003-06-25 전명근 Face Recognition using fuzzy membership value
CN101673348A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Human face recognition method based on supervision isometric projection
CN102073799A (en) * 2011-01-28 2011-05-25 重庆大学 Tumor gene identification method based on gene expression profile
CN102208020A (en) * 2011-07-16 2011-10-05 西安电子科技大学 Human face recognition method based on optimal dimension scale cutting criterion
CN103679207A (en) * 2014-01-02 2014-03-26 苏州大学 Handwriting number identification method and system
CN103854645A (en) * 2014-03-05 2014-06-11 东南大学 Speech emotion recognition method based on punishment of speaker and independent of speaker
CN107045621A (en) * 2016-10-28 2017-08-15 北京联合大学 Facial expression recognizing method based on LBP and LDA
CN107463920A (en) * 2017-08-21 2017-12-12 吉林大学 A kind of face identification method for eliminating partial occlusion thing and influenceing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030051554A (en) * 2003-06-07 2003-06-25 전명근 Face Recognition using fuzzy membership value
CN101673348A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Human face recognition method based on supervision isometric projection
CN102073799A (en) * 2011-01-28 2011-05-25 重庆大学 Tumor gene identification method based on gene expression profile
CN102208020A (en) * 2011-07-16 2011-10-05 西安电子科技大学 Human face recognition method based on optimal dimension scale cutting criterion
CN103679207A (en) * 2014-01-02 2014-03-26 苏州大学 Handwriting number identification method and system
CN103854645A (en) * 2014-03-05 2014-06-11 东南大学 Speech emotion recognition method based on punishment of speaker and independent of speaker
CN107045621A (en) * 2016-10-28 2017-08-15 北京联合大学 Facial expression recognizing method based on LBP and LDA
CN107463920A (en) * 2017-08-21 2017-12-12 吉林大学 A kind of face identification method for eliminating partial occlusion thing and influenceing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AMAN SINGH,BABITA PANDEY: "An Euclidean Distance based KNN Computational Method for Assessing Degree of Liver Damage", 《2016 INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES(ICICT)》 *
PRIYA SAHA ET AL.: "Expressions Recognition of North-East Indian (NEI) Faces", 《SPRINGER SCIENCE+BUSINESS MEDIA NEW YORK 2015》 *
唐晓培, 李力争: "基于核主量和线性鉴别分析的人脸识别算法研究", 《微型机与应用》 *
苟建平: "模式分类的k-近邻方法", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613184A (en) * 2020-12-29 2021-04-06 煤炭科学研究总院 Artificial intelligence method for judging distance of side slope collapse rockfall during earthquake occurrence
CN113162926A (en) * 2021-04-19 2021-07-23 西安石油大学 KNN-based network attack detection attribute weight analysis method
CN113162926B (en) * 2021-04-19 2022-08-26 西安石油大学 KNN-based network attack detection attribute weight analysis method

Also Published As

Publication number Publication date
CN110929801B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US11416710B2 (en) Feature representation device, feature representation method, and program
CN107368807B (en) Monitoring video vehicle type classification method based on visual word bag model
US8718380B2 (en) Representing object shapes using radial basis function support vector machine classification
Niu et al. Meta-metric for saliency detection evaluation metrics based on application preference
CN110942091B (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
JP4376145B2 (en) Image classification learning processing system and image identification processing system
EP2948877A1 (en) Content based image retrieval
Ling et al. How many clusters? A robust PSO-based local density model
CN107633065B (en) Identification method based on hand-drawn sketch
CN110008844B (en) KCF long-term gesture tracking method fused with SLIC algorithm
CN110929801B (en) Improved Euclid distance KNN classification method and system
Duin et al. Mode seeking clustering by KNN and mean shift evaluated
CN111259808A (en) Detection and identification method of traffic identification based on improved SSD algorithm
CN111738319B (en) Clustering result evaluation method and device based on large-scale samples
Chebbout et al. Comparative study of clustering based colour image segmentation techniques
CN111027609B (en) Image data weighted classification method and system
CN115147632A (en) Image category automatic labeling method and device based on density peak value clustering algorithm
CN113378620A (en) Cross-camera pedestrian re-identification method in surveillance video noise environment
CN114202694A (en) Small sample remote sensing scene image classification method based on manifold mixed interpolation and contrast learning
Wang et al. Image matching via the local neighborhood for low inlier ratio
Bakheet et al. Content-based image retrieval using BRISK and SURF as bag-of-visual-words for naïve Bayes classifier
CN113283469A (en) Graph embedding unsupervised feature learning method for three-dimensional model retrieval based on view
García-Ordás et al. Evaluation of different metrics for shape based image retrieval using a new contour points descriptor
Baruque et al. WeVoS scale invariant map
Dawood et al. Combining the contrast information with LPQ for texture classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant