CN115131610A

CN115131610A - Robust semi-supervised image classification method based on data mining

Info

Publication number: CN115131610A
Application number: CN202210718517.9A
Authority: CN
Inventors: 王靖宇; 陈城; 聂飞平; 李学龙
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-30
Anticipated expiration: 2042-06-13
Also published as: CN115131610B

Abstract

The invention relates to a robust semi-supervised image classification method based on data mining, which elongates a data set containing n a multiplied by b pixel scale images into an image data matrix, and provides a robust semi-supervised image classification model based on data mining on the basis of an original data matrix X after dimension reduction treatment: alternately and iteratively optimizing and constructing the obtained objective function and modelingTraining is carried out, so that a classifier W and the membership F of the label-free data are obtained. Using W obtained by training, into formula

And obtaining the membership degree of each sample to each class in the test set, wherein the column number of the maximum value of the membership degree of each sample is the class to which the sample belongs, thereby completing the classification of the test set data. The invention fully utilizes the data to obtain a result which is more in line with the reality; the image data processing efficiency is obviously improved. Therefore, the invention has stronger practicability in practical engineering application. And the class of the sample is represented by the membership degree, so that the influence of boundary points is small and the robustness is strong.

Description

Robust semi-supervised image classification method based on data mining

Technical Field

The invention belongs to the field of image classification and pattern recognition, and relates to a robust semi-supervised image classification method based on data mining.

Background

In most data mining applications, massive data are easy to obtain, but the data labels need to be marked manually, and the data are difficult to obtain. Data labeling is a cumbersome task that takes a lot of time and money. In this case, it is important to fully utilize the abundant unmarked data. Semi-supervised learning, which uses labeled and unlabeled data to learn a predictive model, is just one learning method that is suitable for such situations. The semi-supervised learning model has two types, namely a transduction learning model and an induction learning model. The transductive semi-supervised learning approach learns the label of unlabeled data by propagating the label from labeled data to unlabeled data. The disadvantage of such methods is that they cannot be used for off-sample testing and new test data is not included in the unlabeled data. Therefore, when new data needs to be annotated, the transductive semi-supervised learning approach requires that these new test data be merged into the existing previous data and then the entire model be reconstructed based on the merged data. This method is very inefficient for testing off-sample data. The inductive semi-supervised learning method uses labeled data and unlabeled data to learn classifiers, and the learned classifiers can be used for unlabeled data and can also be used for new off-sample test data. Inductive semi-supervised learning methods are attractive in practice due to the convenience of off-sample testing.

Wang et al (semi-supervised classification algorithm [ J ] based on smooth representation computer science 2021,48(03):124- & 129.) propose a semi-supervised classification algorithm based on smooth representation. The method is characterized in that implicit information in data is mined by constructing a graph, a low-pass filter is applied to smooth the data, and finally the structure information of the graph is used for classifying the unlabeled samples. Although the algorithm considers the synchronization of graph construction and label propagation and the adverse effect of high-frequency data information on classification, the model needs to construct graphs and mine information implicit in data, so that the algorithm is high in time complexity, slow in operation and high in application difficulty in engineering practice.

Currently, in the field of image classification, high label labeling cost causes great difficulty to image processing and retrieval processes, and further causes a sharp drop in processing efficiency. The semi-supervised classification method can mine important classification information contained in the non-labeled data and classify the non-labeled data by using the information of the labeled data. Among many semi-supervised classification methods, the graph-based method is one of the research hotspots in the field of machine learning and data mining in recent years, however, on large sample data, constructing graphs causes the computation to be complex and slow. Therefore, how to improve the classification efficiency and the classification accuracy at the same time is still a challenge for the semi-supervised classification method.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a robust semi-supervised image classification method based on data mining, aiming at the existing semi-supervised learning, labeled data and unlabelled data are learned through a classification model to summarize a model, however, the information mining of the unlabelled data by the classification model is not as good as that of a clustering model, and besides, the semi-supervised classification algorithm based on a graph causes the problem of slow calculation.

Technical scheme

A robust semi-supervised image classification method based on data mining is characterized by comprising the following steps:

step 1: stretching a data set comprising n images of a x b pixel size into an image data matrix

Normalizing the image data matrix by rows to make the average value of each row zero and the standard deviation 1, performing normalization processing, and then obtaining the original data matrix

Wherein n is the number of images,

is a pixel of a single image;

carrying out dimensionality reduction on the normalized image data matrix by using PCA, recording the dimensionality after dimensionality reduction as d, and obtaining a processed image matrix as

Step 2: constructing a robust semi-supervised image classification model based on data mining:

s.t.P≥0,P1＝1,F _l ＝Y _l ,F≥0,F1＝1

wherein: m is the number of clustering clusters and is a model parameter; p is a radical of _ij Is an element of the ith row and jth column of the matrix P, representing the ith data point x _i Membership degree of a jth cluster; alpha is a fuzzy parameter; w is a classifier; x is a radical of a fluorine atom _i Is the ith column of the matrix X and represents the ith sample; z is a radical of _j Representing a clustering center of a jth cluster by a jth column vector of the matrix Z; c is the number of classification classes, which needs to be given in advance according to the data set; f. of _ij Is an element of the ith row and jth column of the matrix F, representing the ith data point x _i Degree of membership to class j class, F _l ＝Y _l Indicating that the F has l labels, the rest n-l labels are not, and the membership degree of the labeled sample needs to be given in advance; r is a fuzzy parameter; t is t _j Is the jth column vector of the matrix T, T _j Each row is 0 except for jth row 1; 1 represents a vector whose elements are all 1;

and 3, step 3:subjecting the product obtained in step 1

Substituting the classification model constructed in the step 2, and adopting an alternative iterative optimization classification model to obtain the membership degree F of the classifier W and the non-label data:

the alternate iterative optimization process comprises the following steps:

1. initialization indication matrix T:

2. fixing W and F, and solving the following relation of P and Z:

when W and F are fixed, the classification model is equivalent to the following formula, and then the constructed Lagrangian function is adopted to solve

s.t.P≥0,P1＝1

The construction of the Lagrangian function:

solving to obtain Z and P as:

fixing P, Z, F, T to obtain W

When P, Z, F, T are fixed, the classification model is equivalent to the following equation:

order to

S is a diagonal matrix and

the above formula is converted into:

rewrite the above equation to functional form:

solving the rear partial derivative:

obtaining by solution:

fixing W, P, Z and obtaining F

When W, P, Z are fixed, the classification model is equivalent to:

s.t.F _l ＝Y _l ,F≥0,F1＝1

let d _ij ＝||W ^T x _i -z _j || ₂ The lagrange function is constructed as follows:

finding the optimal F, function L ₃ (F) The partial derivative for F needs to be zero:

according to given

Obtaining by solution:

repeating the second step to the fourth step, and obtaining W and F after convergence; f is the membership degree of each sample to each class in the training set, and the column number of the maximum membership degree of each sample is the class to which the sample belongs;

and 4, step 4: using W obtained by training, substituting into formula

And obtaining the membership F of each sample to each class in the test set, wherein each column of F represents the membership of one sample to each class, and the row number of the maximum value of the membership of each sample is the class of the sample, thereby completing the classification of the test set data.

Advantageous effects

The invention provides a robust semi-supervised image classification method based on data mining, which elongates a data set containing n a multiplied by b pixel scale images into an image data matrix, and provides a robust semi-supervised image classification model based on data mining on the basis of an original data matrix X after dimension reduction treatment: and alternately and iteratively optimizing the constructed objective function, and training the model to obtain the membership F of the classifier W and the unlabeled data. Using W obtained by training, into formula

Obtaining the membership degree of each sample to each class in the test set, wherein the column number of the maximum membership degree of each sample is the class to which the sample belongs, and thusThe classification of the test set data is completed.

The beneficial effects of the invention include:

(1) the invention provides a semi-supervised learning framework for mining the hidden information of the unlabelled data by using a clustering method, so that the data is more fully utilized, and a result more conforming to the reality is obtained.

(2) The calculation complexity of the method is linearly related to the number n of the images, and the image data processing efficiency is obviously improved. Therefore, the invention has stronger practicability in practical engineering application.

(3) The invention adopts the membership degree to represent the category of the sample, is less influenced by the boundary point and has stronger robustness.

Drawings

FIG. 1: semi-supervised image classification method flow chart

FIG. 2: detailed implementation flow chart on Coil20 object data set

Detailed Description

The invention will now be further described with reference to the following examples, and the accompanying drawings:

the basic flow chart of the data processing of the present invention is shown in fig. 1 at the end of this document, and the specific steps are as follows:

Wherein n is the number of the images,

as pixels of a single image. Because the collected data sizes are not uniform, the image data matrix needs to be normalized according to rows before operation, so that the mean value of each row is zero, the standard deviation is 1, and the normalized original data matrix is obtained

Considering that the data in the image is sparse and the subsequent inversion operation is inconvenient, and meanwhile, in order to improve the operation speed, the normalized image needs to be subjected to normalizationThe data matrix is subjected to dimensionality reduction by PCA (principal component analysis), dimensionality reduction is carried out in a mode of reserving a certain contribution rate, the dimensionality after dimensionality reduction is recorded as d, and the processed image matrix is

Step 2: on the basis of the original data matrix X after the dimensionality reduction processing, the following robust semi-supervised image classification model based on data mining is provided:

wherein m is the number of clustering clusters and is a model parameter; p is a radical of _ij Is an element of the ith row and jth column of the matrix P, representing the ith data point x _i Membership degree of a jth cluster; alpha is a fuzzy parameter; w is a classifier; x is the number of _i Is the ith column of the matrix X and represents the ith sample; z is a radical of _j Representing a clustering center of a jth cluster by a jth column vector of the matrix Z; c is the number of classification classes, which needs to be given in advance according to the data set; f. of _ij Is an element of the ith row and jth column of the matrix F, representing the ith data point x _i Degree of membership to class j class, F _l ＝Y _l Indicating that l labels exist in the F, the rest n-l labels do not exist, and the membership degree of the labeled samples needs to be given in advance; r is a fuzzy parameter; t is t _j Is the jth column vector of the matrix T, T _j Each row is 0 except for jth row 1.

And step 3: and (4) alternately and iteratively optimizing the objective function constructed in the step (3), and training the model to obtain the classifier W and the membership F of the label-free data.

Firstly, initializing indication matrix T

Fixing W and F, solving P and Z

When W and F are fixed, problem (1) is equivalent to:

the optimization problem is a constrained optimization problem, and can be solved by constructing a Lagrangian function, wherein the Lagrangian function is constructed as follows:

note that in the formula (4), the matrix P does not participate in the operation in the form of a matrix, but participates in the operation in the form of elements, Z participates in the operation in the form of vectors, and each element and vector of the same matrix are independent from each other, so that each element in the matrix can be calculated respectively, that is, the optimal solution of each element and vector is obtained first, and the set of the optimal solutions of all elements and vectors is the optimal solution of the matrix. To find the optimum p _ij And z _j Function L ₁ (P, Z) to P _ij And z _j The partial derivatives of both variables need to be zero, thus yielding a series of equations:

it is noted that

Combined with formula (5)

Obtaining Z and P by iteration to convergence by using the formulas (7) and (8);

fixing P, Z, F, T to obtain W

When P, Z, F, T are fixed, problem (1) is equivalent to:

note that T is a unit vector, order

S is a diagonal matrix and

the matrices Y and S can be used to convert equation (9) to

The optimization problem is a constraint-free optimization problem, and can be solved by utilizing partial derivatives, and the above formula is rewritten into a functional form:

to find the optimum W, the function L ₂ The partial derivative of (W) to W needs to be zero:

obtaining by solution:

fixing W, P, Z and obtaining F

When W, P, Z are fixed, problem (1) is equivalent to:

the optimization problem is a constrained optimization problem, can be solved by constructing a Lagrangian function, and is set as d _ij ＝||W ^T x _i -z _j || ₂ The lagrange function is constructed as follows:

to find the optimum F, the function L ₃ (F) The partial derivatives for F need to be zero:

it is noted that

Obtaining by solution:

fifthly, repeating the steps from the second step to the fourth step, and obtaining W and F after convergence. F is the membership degree of each sample to each class in the training set, the number of columns where the maximum membership degree of each sample is located is the class to which the sample belongs,

and 4, step 4: using W obtained by training, into formula

And obtaining the membership degree of each sample to each class in the test set, wherein the column number of the maximum value of the membership degree of each sample is the class to which the sample belongs, thereby completing the classification of the test set data.

The specific embodiment is as follows:

the invention provides a robust semi-supervised image classification method based on data mining. The specific implementation steps of the proposed method for classification are described by taking the object image data set Coil20 as an example, but the technical content of the present invention is not limited to the described scope. The object image data set Coil20 contains 1440 object images of 32 × 32 pixel size, which total 20 objects. The data set was obtained by taking one picture every 5 degrees for each object and horizontally circling around, i.e. 72 images for each object, 1440.

The implementation step 1: taking 64 images of each object as a training set and the rest 8 images as a test set, and lengthening 1280 images into an image data matrix

Wherein 1024-32 × 32 is the total number of pixels of the Coil20 single image;

step 2 is implemented: for the image data matrix obtained in the last step

Normalization is performed so that the row mean of the data matrix is 0 and the standard deviation is 1. The image data matrix after the centralization processing is recorded as

And then carrying out dimensionality reduction on the normalized image data matrix by using PCA (principal component analysis), carrying out dimensionality reduction according to a mode of reserving 95% of contribution rate, wherein experiments show that the target dimensionality is 88, and recording the processed matrix as

And (4) implementing the step 3: on the basis of the image data matrix X, a membership degree matrix is initialized randomly

The data is required to satisfy the row and is 1, and for the data with the label, the membership degree of the data is the corresponding label; random initialization

Setting the parameters alpha to 2, m to 20 and r to 2;

and (4) implementing the step: initialization indication matrix T:

and 5, implementation step: fixing W, F, updating matrices P and Z by the following expressions:

and 6, implementation step: the matrices Y and S are calculated,

s is a diagonal matrix and

step 7 is implemented: fixing P, Z, F, updating the matrix W by the following expression:

and step 8: fixing P, Z, W, and updating the classification membership matrix F through the following expression:

and step 9 is implemented: circularly implementing the step 4 to the step 8 until the value of the objective function (2) is converged, and outputting a classifier W and a classification membership matrix

Wherein, F corresponds to one sample in each row, and the number of columns in each row where the maximum value is located represents the category to which the sample belongs.

Step 10 of substituting the W obtained by training into a formula

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. A robust semi-supervised image classification method based on data mining is characterized by comprising the following steps:

step 1: stretching a data set comprising n a x b pixel scale images into an image data matrix

Normalizing the image data matrix according to rows to make the average value of each row zero and the standard deviation 1, and then normalizing the original data matrix

Wherein n is the number of the images,

is a pixel of a single image;

And 2, step: constructing a robust semi-supervised image classification model based on data mining:

s.t.P≥0,P1＝1,F _l ＝Y _l ,F≥0,F1＝1

wherein: m is the number of clustering clusters and is a model parameter; p is a radical of _ij Is an element of the ith row and jth column of the matrix P, representing the ith data point x _i Membership degree of a jth cluster; alpha is a fuzzy parameter; w is a classifier; x is the number of _i Is the ith column of the matrix X and represents the ith sample; z is a radical of formula _j Representing a clustering center of a jth cluster by a jth column vector of the matrix Z; c is the number of classification classes, which needs to be given in advance according to the data set; f. of _ij Is an element of the ith row and jth column of the matrix F, representing the ith data point x _i Degree of membership to class j classification, F _l ＝Y _l Indicating that l labels exist in the F, the rest n-l labels do not exist, and the membership degree of the labeled samples needs to be given in advance; r is a fuzzy parameter; t is t _j Is the jth column vector of the matrix T, T _j Each row is 0 except for jth row 1; 1 represents a vector with elements all being 1;

and step 3: subjecting the product obtained in step 1

Substituting the classification model constructed in the step 2, and adopting an alternative iteration optimization classification model to obtain the membership F of the classifier W and the non-label data:

the alternate iterative optimization process is as follows:

1. initializing the indication matrix T:

2. fixing W and F, and solving the following relation of P and Z:

s.t.P≥0,P1＝1

The construction of the Lagrangian function:

solving to obtain Z and P as:

fixing P, Z, F, T to obtain W

order to

S is a diagonal matrix and

the above formula translates to:

rewrite the above equation to functional form:

solving the rear partial derivative:

obtaining by solution:

fixing W, P, Z and obtaining F

When W, P, Z are fixed, the classification model is equivalent to:

s.t.F _l ＝Y _l ,F≥0,F1＝1

finding the optimal F, function L ₃ (F) The partial derivatives for F need to be zero:

according to given

Obtaining by solution:

and 4, step 4: using W obtained by training, substituting into formula

And obtaining the membership F of each sample to each class in the test set, wherein each F column represents the membership of one sample to each class, and the row number of the maximum value of the membership of each sample is the class to which the sample belongs, thereby finishing the classification of the test set data.