CN112990242A

CN112990242A - Training method and training device for image classification model

Info

Publication number: CN112990242A
Application number: CN201911291584.1A
Authority: CN
Inventors: 史英迪; 程建波; 彭南博; 黄志翔
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2021-06-18

Abstract

The disclosure provides a training method and a training device for an image classification model, and relates to the field of image processing. The method fuses the feature processing part and the image classification part into an objective function, so that the feature processing part and the image classification part can be trained together, a prediction result can be obtained from an original feature input end to a classification label output end, an error can be obtained by comparing the prediction result with a real result, the error can be transmitted in each part of the objective function, the representation of each part can be adjusted according to the error until the whole objective function converges or reaches an expected effect, all the intermediate operations are contained in the objective function, and therefore the whole training effect of the feature processing part and the image classification part is improved.

Description

Training method and training device for image classification model

Technical Field

The present disclosure relates to the field of image processing, and in particular, to a training method and a training apparatus for an image classification model.

Background

With the development of network technology, people express comments and transmit emotions through images (or pictures) more and more. In some services, an image needs to be analyzed to mine an emotional tendency corresponding to the image, and corresponding service processing is performed.

In some image emotion classification technologies based on machine learning, the characteristics of color, texture, shape and the like of an image are extracted, the characteristics are processed, and a classifier is trained by using the processed characteristics to carry out emotion class inference on the image.

Disclosure of Invention

The inventor finds that the related art is a non-end-to-end solution, the feature processing and the emotion classification are taken as a plurality of independent steps, each step is an independent task, and the quality of the result can influence the next step, so that the overall training effect is influenced.

The method fuses the feature processing part and the image classification part into an objective function, so that the feature processing part and the image classification part can be trained together, a prediction result can be obtained from an original feature input end to a classification label output end, an error can be obtained by comparing the prediction result with a real result, the error can be transmitted in each part of the objective function, the representation of each part can be adjusted according to the error until the whole objective function converges or reaches an expected effect, all the intermediate operations are contained in the objective function, and therefore the whole training effect of the feature processing part and the image classification part is improved.

Some embodiments of the present disclosure provide a training method of an image classification model, including:

constructing an objective function for training an image classification model, wherein the objective function comprises a characteristic conversion part from original characteristic representation to potential characteristic representation of an image and an image classification model from the potential characteristic representation serving as input characteristic representation to output label representation;

and training the objective function by using the original features of each image in the training set and the labeled classification labels to simultaneously determine the values of the projection matrix representation of the original feature representation of the image to the potential feature representation of the image in the feature conversion part and the values of the regression coefficient representation of the potential feature representation to the output label representation in the image classification model.

In some embodiments, the raw feature representation of the image comprises raw feature representations of a plurality of perspectives of the image, and the projection matrix representation comprises a respective plurality of projection matrix representations of the raw feature representation of each perspective of the image to a potential feature representation of the image.

In some embodiments, the feature transformation portion includes a relational representation of the original feature representation to the potential feature representation of the image constructed based on the projection matrix representation.

In some embodiments, the relationship of the original feature representation to the potential feature representation of the image is represented as: the product of the original feature representation of the image and the projection matrix representation minus a function of the norm of the potential feature representation.

In some embodiments, the feature conversion section further includes: one or more of redundant constraints of the original feature representation of the different view angles, low rank constraints of the projection matrix representation, and regularization constraints of the potential feature representation.

In some embodiments, the redundancy constraint of the original eigenrepresentations of the different views is a product between a transpose of a first projection matrix representation corresponding to one view, a covariance matrix of the original eigenrepresentations of the respective views, and a second projection matrix representation corresponding to another view.

In some embodiments, the low rank constraint of the projection matrix representation is a nuclear norm of the projection matrix representation.

In some embodiments, the regularization constraint of the potential feature representation is a function of a norm of the potential feature representation.

In some embodiments, the image classification model is a lasso regression image classification model or a logistic regression image classification model.

In some embodiments, the lasso regression image classification model includes a first relational representation of the potential feature representation to the output label representation constructed based on a regression coefficient representation and a regularization constraint of the regression coefficient representation.

In some embodiments, the logistic regression image classification model includes a second relational representation of the potential feature representation to the output label representation constructed based on regression coefficient representations.

In some embodiments, the raw feature representations for the multiple perspectives of the image comprise a foreground feature representation and a background feature representation of the image.

In some embodiments, the raw features of each image in the training set include foreground features and background features of the respective image, wherein the foreground features of the images are extracted using a VGG neural network model; or, extracting the background feature of the image by using an AlexNet model.

In some embodiments, further comprising: and aiming at an image to be classified, processing the original characteristics of the image to be classified by using the value represented by the trained projection matrix to obtain the potential characteristics of the image to be classified, and inputting the potential characteristics of the image to be classified into an image classification model obtained by training to output a classification label corresponding to the image to be classified.

In some embodiments, the classification label labeled by each image in the training set and the classification label corresponding to the image to be classified are emotion labels.

Some embodiments of the present disclosure provide a training apparatus for an image classification model, including:

a memory; and a processor coupled to the memory, the processor configured to perform the training method of any of the embodiments based on instructions stored in the memory.

Some embodiments of the disclosure propose a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the training method described in any of the embodiments.

Drawings

The drawings that will be used in the description of the embodiments or the related art will be briefly described below. The present disclosure will be more clearly understood from the following detailed description, which proceeds with reference to the accompanying drawings,

it is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.

Fig. 1 is a schematic flow chart diagram illustrating some embodiments of a training method for an image classification model according to the present disclosure.

Fig. 2 is a schematic flow chart diagram of some embodiments of the disclosed image classification method.

Fig. 3 is a schematic structural diagram of some embodiments of a training apparatus for image classification models according to the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.

Fig. 1 is a schematic flow chart diagram illustrating some embodiments of a training method for an image classification model according to the present disclosure. The image classification model may, for example, classify the image for emotion, such as whether the image is positive or negative, or whether the image is happy, angry, exclamatory, satisfied, aversive, afraid, etc., but is not limited to these examples.

As shown in fig. 1, the training method of this embodiment includes: steps 11-12 (step 12 includes steps 121-123).

In step 11, an objective function for training an image classification model is constructed, the objective function including a feature conversion part of the original feature representation of the image into a potential feature representation and an image classification model of the potential feature representation serving as an input feature representation into an output label representation.

The original feature representation of the image may include an original feature representation of one perspective of the image, or may include original feature representations of multiple perspectives of the image for a more complete description of the image. The raw feature representation of each view of the image corresponds to one projection matrix representation to the potential feature representation of the image, and thus, in the case of raw feature representations for multiple views, to multiple projection matrix representations. For example, the raw feature representations of the plurality of perspectives of the image comprise a foreground feature representation and a background feature representation of the image, the foreground feature representation to potential feature representation corresponding to the first projection matrix representation and the background feature representation to potential feature representation corresponding to the second projection matrix representation, respectively.

The feature conversion section and the image classification model are explained separately below.

The feature transformation part comprises a relationship representation of an original feature representation to a potential feature representation of the image constructed based on the projection matrix representation, and can optionally further comprise: one or more of redundant constraints of the original feature representation of the different view angles, low rank constraints of the projection matrix representation, and regularization constraints of the potential feature representation.

The formula is expressed as follows:

by the above formula, the overall correlation of different viewing angles is maximized, while the characteristic redundant parts are partially discarded, thereby fully utilizing the complementary properties of different viewing angles to eliminate the heterogeneity therebetween. The meaning of each symbol is specifically described in the description of each component section below. The method is also called low-rank multi-view learning method and aims to learn the expression Z epsilon R with low-rank constraint and potential (shared) features^N×DA set of projection matrices

Taking the extraction of the original features of two views as an example, i.e.

Representing a foreground feature representation and a background feature representation of the image respectively,

a (low rank) projection matrix representing the raw feature representation to the potential feature representation.

The relationship representation of the original feature representation to the potential feature representation of the image is, for example: raw feature representation of an image (foreground feature representation X)₁And background feature representation X₂) And projection matrix representation (foreground feature representation X)₁First projection matrix representation P to potential feature representation Z₁And background feature representation X₂Second projection matrix representation P to potential feature representation Z₂) Minus a function of the norm of the potential signature representation Z. The formula is expressed as follows:

the function of the norm shown in equation 4-2 is the square of the norm, and may be an even power such as 4 th power, 6 th power, etc.).

Equation (4-2) is intended to seek such that

Smallest group (P)₁，P₂Z) when

At the minimum, Z can represent X to the maximum extent₁And X₂。

If the original feature has only one perspective, then equation (4-2) can be replaced with equation (4-3).

Equation (4-3) is a relational representation between the original feature representation X and the potential feature representation Z of the image, i.e. a function of the product of the original feature representation X of the image and the projection matrix representation P minus the norm of the potential feature representation Z (the function of the norm shown in equation 4-3 is the square of the norm and may also be an even power of the norm, such as 4, 6, etc.).

Equation (4-3) is intended to seek such that

A minimum group (P, Z) of

At the minimum, Z can represent X to the maximum extent.

The redundancy constraint for the original eigenrepresentations of different views is the product between the transpose of the corresponding first projection matrix representation of one view, the covariance matrix of the original eigenrepresentation of the respective view and the corresponding second projection matrix representation of the other view. The formula is expressed as follows:

wherein the content of the first and second substances,

is defined as X₁And X₂Phi is a configurable parameter, tr represents the trace of the matrix, and equation (4-4) is a redundancy constraint of the original feature representation for different viewing angles, intended to remove X as much as possible₁And X₂Redundant information of (2).

The regularization constraint of the potential feature representation is a function of the norm of the potential feature representation, intended to improve the over-fitting problem, and is formulated as:

wherein λ is₁Is a configurable parameter (the function of the norm shown in equations 4-5 is the square of the norm, and can also be the even power of the norm, such as the 4 th power, the 6 th power, etc.)

The low-rank constraint represented by the projection matrix is a nuclear norm represented by the projection matrix, aims to remove detail redundant information such as noise of an image as far as possible, and is helpful for learning a more robust feature subspace, and the formula is represented as follows:

wherein, beta₁And beta₂Is a configurable parameter.

The image classification model is a lasso regression image classification model or a logistic regression image classification model. lasso coming backThe classification model includes a first relational representation of a potential feature representation Z to an output label representation Y constructed based on regression coefficient representations

And regularization constraints for regression coefficients representing B

The formula for the lasso regression image classification model is as follows:

the above formula is intended to seek (Y-ZB)²+λ₂||B||₁The smallest group (Z, B) when (Y-ZB)²+λ₂||B||₁At the minimum, Y can represent B to the maximum extent. Regularization constraint for regression coefficient representation B

With the intention of improving the overfitting problem, the regularization constraint of the regression coefficients is for example the 1 norm, λ, of the regression coefficients₂Is a configurable parameter.

The logistic regression image classification model includes a second relational representation of the potential feature representation Z to the output label representation Y constructed based on the regression coefficient representation B, formulated as follows:

the above formula is intended to seek a group (Z, B) that minimizes (Y-ZB), when (Y-ZB) is minimized, Y can represent B to the greatest extent.

As previously described, the objective function used to train the image classification model fuses the feature transformation portion of the original feature representation of the image into the latent feature representation and the image classification model of the latent feature representation used as the input feature representation into the output label representation.

In one case, the overall objective function is a combination of equations (4-1) and (4-7), which is expressed as follows:

equations (4-9) are intended to seek a set (P) that minimizes the entire equation after min₁，P₂And B). After the feature conversion part is fused with the image classification model, Z is an intermediate variable which does not need to be output in the objective function.

In another case, the overall objective function is obtained by fusing the formula (4-1) and the formula (4-8), and the formula is expressed as follows:

equations (4-10) are intended to seek a set (P) that minimizes the entire equation after min₁，P₂And B). After the feature conversion part is fused with the image classification model, Z is an intermediate variable which does not need to be output in the objective function.

In step 12, the original features of each image in the training set and the labeled classification labels are used to train the objective function, so as to simultaneously determine the values (i.e. projection matrix) represented by the projection matrix from the original feature representation of the image to the potential feature representation of the image in the feature transformation portion and the values (i.e. regression coefficients) represented by the regression coefficients from the potential feature representation to the output label representation in the image classification model, thereby simultaneously completing the training of the feature transformation portion and the image classification model.

Step 12 is described below by steps 121-123.

In step 121, the classification labels (e.g., emotion classification labels) corresponding to the images in the training set are labeled.

In step 122, feature extraction is performed on each image in the training set, and the extracted features are original features of the image.

The visual system has strong information processing capability, and the relationship between vision and emotion is complicated. Reasonably constructing the relation between the visual high-level emotion semantics and the low-level visual features, understanding the emotion information expressed by the user from the cognitive perspective, and providing important research content for perception-oriented visual emotion analysis.

As previously mentioned, to more fully describe an image, the image may be described from multiple perspectives including, but not limited to, color features, texture features, object features (also referred to as foreground features), scene features (also referred to as background features), and the like. For example, an image may be described from two perspectives, a foreground and a background, with the extracted original features of the image including foreground and background features of the image. Foreground features of the image may be extracted using a VGG (e.g., VGG16) neural network model. The background features of the image can be extracted using an AlexNet model. The VGG16 neural network model and the AlexNet model were trained using the Pre-trained method. The VGG16 network can be obtained by pre-training based on 1000 classes of image Object types, so 4096-dimensional features output by the full connection layer are mainly Object (Object) features of the image, and the Object features mainly extract main body features of the image as foreground features of the image. The AlexNet network can be obtained based on 205 types of scenes, 2048-dimensional features output by a full connection layer mainly extract background contents of an image as background (scene) features of the image. In addition, using the pre-trained parameters as the initial parameters of the feature extraction section, the convergence of the network can be accelerated.

The image features (such as foreground features and background features) of a plurality of visual angles are used for describing the image together to obtain rich information, so that the emotional content expressed in the image by a user can be better understood. However, due to the heterogeneity among different modalities, information redundancy is easily caused if the features of multiple modalities are directly integrated into one large feature vector, and for this reason, it is necessary to find a potentially shared feature space of all modalities by using the complementarity among different modalities, so the embodiment uses a low-rank multi-view learning method to obtain the most intrinsic feature representation of an image, which is described in detail in the feature processing section.

In step 123, the original features of each image in the training set and the labeled classification labels are used to train the objective function until the whole objective function converges or reaches the expected effect, so as to complete the training of the feature transformation portion and the image classification model at the same time, and finally determine the projection matrix in the feature transformation portion and the regression coefficient in the image classification model at the same time.

The training end conditions are, for example: the difference between two successive objective function values in the iterative process is calculated and the training is stopped if the relative change in the objective function values is below a preset threshold (e.g. 0.001) and/or a preset maximum number of iterations is reached (e.g. 30).

In addition, the image features (such as foreground features and background features) of a plurality of visual angles are used for describing the image together to obtain rich information, and a multi-visual-angle subspace learning method is used for mapping each sample in a high-dimensional space to a low-dimensional subspace, so that the learning features of each subspace are reserved, the complementarity among the multi-visual angles is fully utilized, the emotion classification effect of the image is improved, the influence of learning dimension explosion is avoided, and the practicability is good.

Fig. 2 is a schematic flow chart diagram of some embodiments of the disclosed image classification method. This embodiment may, for example, classify the image for emotion, such as analyzing whether the image is positive or negative, or whether the image is happy, angry, exclamatory, satisfied, aversive, afraid, etc., but is not limited to the examples given.

As shown in fig. 2, the training method of this embodiment includes: steps 21-23.

In step 21, for an image to be classified, its original features are extracted, for example, its foreground features X are extracted₁And background feature X₂。

The feature extraction method may refer to an extraction method of image features in a training set.

In step 22, the original features of the image to be classified are processed by using the projection matrix obtained by training to obtain the potential features of the image to be classified. For example, the original features of different perspectives of the image to be classified are subjected to feature processing to mine potential shared features (which may be referred to as potential features for short).

The formula is expressed as follows:

Z＝X₁P₁+X₂P₂

in step 23, the latent features of the image to be classified are input into the trained image classification model, and the classification label corresponding to the image to be classified is output.

In the case of the lasso regression image classification model, the regression coefficients B have been determined by training, and the potential features Z are input to min (Y-ZB)²+λ₂||B||₁And outputting the corresponding classification label Y. Thus classification is achieved by means of lasso regression.

And if the image classification model is a logistic regression image classification model, the regression coefficient B is determined through training, the potential features Z are input into min (Y-ZB), and the corresponding classification labels Y are output. Thereby realizing classification by means of logistic regression.

The fused features have strong distinguishing capability, so that the performance of the image classification model can be improved.

As shown in fig. 3, the training apparatus 300 of this embodiment includes: a memory 310 and a processor 320 coupled to the memory 310, the processor 320 configured to perform the training method of any of the foregoing embodiments based on instructions stored in the memory 310.

Memory 310 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

Training device 300 may also include input-output interface 330, network interface 340, storage interface 350, and the like. These

interfaces

330, 340, 350 and the memory 310 and the processor 320 may be connected, for example, by a bus 360. The input/output interface 330 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 340 provides a connection interface for various networking devices. The storage interface 350 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A training method of an image classification model is characterized by comprising the following steps:

2. The method of claim 1,

the raw feature representation of the image comprises raw feature representations of a plurality of perspectives of the image, and the projection matrix representation comprises a respective plurality of projection matrix representations of the raw feature representation of each perspective of the image to a potential feature representation of the image.

3. The method according to claim 1 or 2,

the feature transformation portion includes a relational representation of original feature representations to potential feature representations of the image constructed based on the projection matrix representation.

4. The method of claim 3,

the relationship of the original feature representation to the potential feature representation of the image is represented as: the product of the original feature representation of the image and the projection matrix representation minus a function of the norm of the potential feature representation.

5. The method of claim 3,

the feature conversion section further includes: one or more of redundant constraints of the original feature representation of the different view angles, low rank constraints of the projection matrix representation, and regularization constraints of the potential feature representation.

6. The method of claim 5,

the redundancy constraint of the original characteristic representations of different view angles is the product of the transpose of the first projection matrix representation corresponding to one view angle, the covariance matrix of the original characteristic representation of each view angle and the second projection matrix representation corresponding to another view angle;

or the low rank constraint of the projection matrix representation is a nuclear norm of the projection matrix representation;

alternatively, the regularization constraint of the potential feature representation is a function of a norm of the potential feature representation.

7. The method according to claim 1 or 2,

the image classification model is a lasso regression image classification model or a logistic regression image classification model.

8. The method of claim 7,

the lasso regression image classification model includes a regularization constraint of the regression coefficient representation and a first relational representation of the potential feature representation to the output label representation constructed based on the regression coefficient representation;

alternatively, the logistic regression image classification model comprises a second relational representation of the potential feature representation to the output label representation constructed based on regression coefficient representations.

9. The method of claim 2,

the raw feature representations for the multiple perspectives of the image comprise a foreground feature representation and a background feature representation of the image.

10. The method of claim 9,

the raw features of each image in the training set include foreground and background features of the respective image,

extracting foreground features of the image by using a VGG neural network model; or, extracting the background feature of the image by using an AlexNet model.

11. The method of claim 1 or 2, further comprising:

and aiming at an image to be classified, processing the original characteristics of the image to be classified by using the value represented by the trained projection matrix to obtain the potential characteristics of the image to be classified, and inputting the potential characteristics of the image to be classified into an image classification model obtained by training to output a classification label corresponding to the image to be classified.

12. The method of claim 11,

and the classification label marked by each image in the training set and the classification label corresponding to the image to be classified are emotion labels.

13. An apparatus for training an image classification model, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the training method of any of claims 1-12 based on instructions stored in the memory.

14. A non-transitory computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the training method according to any one of claims 1 to 12.