CN111209962A

CN111209962A - Combined image classification method based on CNN (CNN) feature extraction network) and combined heat map feature regression

Info

Publication number: CN111209962A
Application number: CN202010008389.XA
Authority: CN
Inventors: 陈波; 邓媛丹; 吴思璠; 冯婷婷; 张勇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-05-29
Anticipated expiration: 2040-01-06
Also published as: CN111209962B

Abstract

The invention discloses a CNN feature extraction network-based combined image classification method based on combined heat map feature regression, which comprises the steps of obtaining image information and preprocessing the image information, training a CNN-based feature extraction classification network model with heat map feature distribution to obtain a first prediction result, training a feature heat map regression network model to obtain a second prediction result, constructing a combined regression network model, splicing the first prediction result and the second prediction result to calculate final prediction probability, and outputting the classification result. The invention adopts a combined regression network architecture combining the CNN-based feature extraction classification neural network with the heat map feature distribution and the feature heat map regression network, and can help the CNN-based classification network with the heat map feature distribution to better realize and improve generalization capability and stability, and further improve image classification accuracy.

Description

Combined image classification method based on CNN (CNN) feature extraction network) and combined heat map feature regression

Technical Field

The invention relates to the technical field of image processing, in particular to a combined image classification method which mainly adopts a CNN-based feature extraction network with heat map feature distribution and assists a feature heat map regression network.

Background

In general, a typical convolutional neural network mainly includes convolutional layers, pooling layers, and fully-connected layers. The convolution layer and the pooling layer are matched to form a plurality of convolution groups, characteristics are extracted layer by layer, classification is finished through a plurality of full connection layers, and data dimensionality reduction is mainly performed on the pooling layer.

The Chinese translation name of MesoNet, namely 'pocket network', is characterized by less network parameters, higher precision under the condition of less training rounds, and weaker stability and generalization capability. After the three previous layers of the typical traditional convolutional layer are replaced by two inclusion layers, the training speed is slightly slowed, but the result is more stable, and higher precision is achieved, and the network at this time is called as MesoNet _ inclusion v 4.

The main feature of inclusion is better utilization of computational resources within the network, which is achieved by a well-crafted design that allows increasing the depth and width of the network while keeping the computational budget unchanged. To optimize quality, the architectural decision is based on the hebrew principle and multi-scale processing. Filters with multiple sizes are run at the same level, making the network wider rather than deeper. So that more detailed features can be preserved.

The existing methods for classifying the DeepFakes pictures comprise the following three methods: namely, the white-box method, the black-box method, and a combination of the two.

1) The white-box approach, which typically explicitly gives the characteristics of a real screenshot and a DeepFakes screenshot. Such as their differences in bio-signals, blink detection, blockchains and smart contracts, and visual characteristics, as the criteria for determination.

2) The black box method, as the name implies, gives a judgment result in case the internal structure is completely transparent. The black box method is generally to build a two-class neural network based on the CNN or the RNN.

3) The black and white box combination method is to obtain intermediate results with obvious differences by using the white box method mentioned above, and send the intermediate results into the black box method mentioned above for further distinction.

The white-box method described above toggles on picture features such as blink detection for non-blinking original video, visual feature detection for news video, etc.; in contrast, the black box method is mostly classified for classification, and relies on training to generate models, data sets, specific faces, and the like.

Disclosure of Invention

Aiming at the problems of over-training fitting, low image classification precision and the like in a feature extraction network adopted by the conventional image classification method, the invention provides an image classification method of a combined image classification network, which mainly uses a CNN-based feature extraction network with heat map feature distribution and assists a feature heat map regression network.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a CNN feature extraction network-based combined image classification method based on combined heat map feature regression comprises the following steps:

s1, acquiring true and false images to be classified, and preprocessing the images;

s2, constructing a CNN-based feature extraction classification network model with heat map feature distribution, and training the model by using the training set image preprocessed in the step S1 to obtain a first prediction result;

s3, constructing a feature heat map regression network model, and training the model by using the training set images preprocessed in the step S1 and the feature images obtained in the step S2 to obtain a second prediction result;

s4, constructing a joint regression network model, splicing the first prediction result obtained in the step S2 and the second prediction result obtained in the step S3, and calculating the final prediction probability through the joint regression network model;

and S5, performing distance measurement on the final prediction probability obtained in the step S4 and the real label by adopting a two-class cross entropy loss function, and outputting a classification result.

Further, the step S2 is specifically:

and (4) training by using the CNN-based feature extraction classification network model with heat map feature distribution constructed by the training set image preprocessed in the step (S1), learning image feature information between true and false images, converting the input image into a feature image and outputting the feature image to a feature heat map regression network, and outputting the feature image to obtain a first prediction result through a softmax layer after passing through a first full-connection layer and a second full-connection layer of the classification network with heat map features.

Further, the CNN-based feature extraction classification network model with heat map feature distribution is provided with a ReSize layer after inputting image information, and the ReSize layer is used for scaling a picture into a size suitable for post-mapping neuron feature values.

Further, the first three groups of convolution layers, Rule layers, batch normalization layers and maximum pooling layers in the CNN-based feature extraction classification network model with heat map feature distribution are replaced by two groups of inclusion layers, wherein each group of inclusion layers comprises seven parallel convolution layers and one batch normalization layer.

Further, an LeakyRule activation function is added after a Rule layer in the CNN-based feature extraction classification network model with the heat map feature distribution, and a dropout layer in front of a second full-connection layer is removed.

Further, the feature heat map regression network model comprises a feature extraction part, a face recognition positioning base and a linear regression layer module in the CNN-based feature extraction classification network model with heat map feature distribution.

Further, the step S3 is specifically:

firstly, extracting neuron characteristic values by using a characteristic extraction part in a CNN-based characteristic extraction classification network model with heat map characteristic distribution, then carrying out face positioning on a training set image preprocessed in the step S1 by using a face recognition positioning library to obtain neuron characteristic values of an eye part, carrying out scaling and mapping on the neuron characteristic values and the extracted neuron characteristic values to obtain characteristic values of each coordinate point of the eye part and taking out a maximum value to calculate the relative size in each characteristic value, and outputting a second prediction result through a linear regression layer module.

Further, the step S4 is specifically:

and splicing a first prediction result output by the CNN-based feature extraction classification network model with heat map feature distribution with a second prediction result output by the feature heat map regression network model, obtaining a true-false map score of the image through a second linear regression layer, and outputting the true-false map score through a softmax layer to obtain a final prediction probability.

The invention also discloses a storage medium which stores computer instructions, and the computer instructions execute the steps of the combined image classification method based on the CNN feature extraction network of the combined heat map feature regression when running.

The invention also discloses a terminal, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the steps of the joint image classification method based on the joint heat map feature regression and based on the CNN feature extraction network when running the computer instructions.

The invention has the beneficial effects that: the invention adopts a combined regression network architecture combining the CNN-based feature extraction classification neural network with the heat map feature distribution and the feature heat map regression network, and can help the CNN-based classification network with the heat map feature distribution to better realize and improve generalization capability and stability, and further improve image classification accuracy.

Drawings

FIG. 1 is a schematic flow chart of a joint image classification method based on a CNN feature extraction network of the joint heat map feature regression of the present invention;

fig. 2 is a schematic structural diagram of a joint image classification network based on a CNN feature extraction network in joint heat map feature regression in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, an embodiment of the present invention provides a method for classifying joint images based on a CNN feature extraction network by joint heat map feature regression, including the following steps S1 to S5:

in the embodiment, 175 forged videos are collected from different platforms by taking a faceforces + + deep fakes image data set as an example; the lowest resolution is 854 × 480 pixels. All videos were compressed with h.264 codec but at different compression levels to provide the actual analysis conditions. All faces were extracted using the Viola-Jones detector and aligned to face landmarks using a trained neural network. To balance the distribution of the faces, the number of frames selected per video is proportional to the angle of the target face and the illumination change.

The invention divides the acquired DeepFakes image data set into a training set and a testing set, and preprocesses images of the training set and the testing set, comprising the following steps:

(1) in order to adapt to the input structure of the network, the image is subjected to size conversion, the size after conversion is 256 × 256, and three channels are reserved.

(2) The image was normalized to the array → tensor data type conversion and min-max.

Besides the two preprocessing operations, the test set image also needs to include a normalization operation, i.e. the original three dimensions between 0 and 1 are transformed into an interval between-1 and 1.

in this embodiment, the CNN-based feature extraction classification network model with heat map feature distribution includes a pre-trained VGG series model or MesoNet _ inclusion v4 model with heat map features, and the neural output result in front of the network fully-connected layer of the classification network with heat map features has obvious features in eyes and mouths, and is described below with MesoNet _ inclusion v4 as an example.

The CNN-based feature extraction classification network model with the heat map feature distribution is provided with a ReSize layer after image information is input, and the ReSize layer is used for scaling a picture into a size suitable for a post-mapping neuron feature value.

In the MesoNet network, the first three groups of conventional convolutional layers, each group of which includes a convolutional layer, a Rule layer, a batch normalization layer and a max pooling layer, are replaced with two groups of addition layers, each group of which includes seven parallel convolutional layers and one batch normalization layer.

In the MesoNet network, a second activation function LeakyRule is added after a Rule layer, and a dropout layer in front of a second full connection layer is removed to keep more characteristic information. And the feature extraction part in the CNN-based feature extraction classification network with the heat map feature distribution is used for extracting neuron feature values after the sixteenth convolution layer.

Based on the improved CNN-based feature extraction classification network model with heat map feature distribution, step S2 specifically includes:

Referring to the CNN-based feature extraction classification network model with heat map feature distribution in fig. 2, which starts with a first and second consecutive inclusion layers, the inclusion layers can be expressed as:

y_i＝(a_i,b_i,c_i,d_i)x_i

x_i+1＝f(y_i)

where x, y denote the input and output of the first initiation layer, a_i,b_i,c_i,d_iIs the four hyper-parameters of the inclusion layer, f (y)_i) Representing the activation function, in the present invention the ReLU function is chosen, denoted f (y)_i)＝ReLU(y_i)＝max(0,y_i)。

The idea of this module is to superimpose the outputs of several convolutional layers with different kernel shapes, thereby increasing the function space for model optimization.

In order to avoid high semantic property, the invention uses 3 × 3 extended convolution to replace 5 × 5 convolution of the original module, and adds 1 × 1 convolution before the extended convolution for dimension reduction, and adds 1 × 1 convolution as jump connection between continuous modules.

The Rule layer and the Leakyrule layer are connected in sequence to introduce nonlinearity so as to improve the generalization. And the second standard layer is used to adjust the output to prevent the gradient from disappearing.

The features extracted from the sixteenth convolution layer are stored for mapping with the face anchor point in step S2.

The Dropout layer before the first and second fully connected layers is used for tuning and increasing robustness. And converting the input image into a characteristic image and outputting the characteristic image to a characteristic heat map regression network. Meanwhile, the feature image is output through the first fully connected layer and the second fully connected layer to obtain a prediction score, which is denoted as batch.

The prediction score is subjected to a softmax layer to obtain a prediction probability, and an index result, namely a prediction result 1 in fig. 1, is obtained through a torch

batch*(1-p,p)＝Softmax(batch*(score1_true,score1_false))。

The first and second inclusion layers of the MesoNet _ inclusion v4 network learn the image feature information between the true and false images of the input image and superimpose the output, after the sixteenth convolution layer, convert the input image into a feature image and output the feature image to the feature heat map regression model, and save the prediction result 1 obtained by the second full-link layer about the image true and false score.

in this embodiment, the feature heat map regression network model includes a feature extraction section in the CNN-based feature extraction classification network model with heat map feature distribution, a face recognition localization database, and a linear regression layer module.

The feature extraction part in the CNN-based feature extraction classification network model with the heat map feature distribution is used to extract neuron feature values after the sixteenth convolution layer.

The face recognition positioning library comprises face recognition and face coordinate positioning, and is used for mapping neuron characteristic values at eye openings by taking a Dlib face recognition positioning library as an example.

The linear regression layer module is used for receiving the eye and mouth neuron characteristic values of the Dlib face recognition positioning library and obtaining a second prediction result through the first linear regression layer.

Based on the feature heat map regression network model, step S3 specifically includes:

Referring to the feature heat map regression network model in fig. 2, a Dlib face recognition and localization library is used to generate a 128-dimensional vector from the face through Resnet34, and perform distance calculation in this space to perform coordinate localization points [68] of 68 points of the face, the present invention obtains coordinate point information of the eye and mouth through the localization of 68 points, specifically, the left eye is 36 points to 41 points [36:42], the right eye is 42 points to 47 points [42:48], the mouth is 48 points to 54 points [48:54], and the centroid point coordinates of each part are extracted, taking the right eye as an example, and expressed as:

cen_right＝centroid(points[36:42])

and scaling and mapping the three groups of point coordinates and the feature image obtained in the step S2 to obtain three feature values, and inputting the three feature values into the linear regression layer. A set of scores predicted by the heat map is obtained and is represented as

batch*(score2_true,score2_false)＝Linear(left_value,right_value,mouth_value)

Similarly to step S2, the result probability of the heat map regression model obtained by subjecting the score result obtained by the above formula to a layer of softmax is expressed as

batch*(1-q,q)＝Softmax(batch*(score2_true,score2_false))

Six point coordinates are selected from the left eye, the right eye and the mouth respectively according to the Dlib module characteristics in the characteristic heat map regression model. The centroid coordinates of the three regions are calculated to ensure that the maximum value for each region is taken. And the coordinate values are subjected to equal scaling, are mapped with the characteristic images output by the characteristic value mapping layer receiving the sixteenth convolution layer to obtain corresponding characteristic values, and the respective relative maximum values of the three regions are taken out. And (3) passing the image through a first linear regression layer to obtain a prediction result 2 of the true and false score of the image.

in this embodiment, the joint regression network model includes a second linear regression layer, and the second linear regression layer inputs the splicing result of the first prediction result and the second prediction result and outputs the final picture prediction result.

Based on the joint regression network model, step S4 specifically includes:

The above true-false graph score is expressed as:

final_pred_val＝Linear(batch*(p,q))

and (3) obtaining the final probability by passing the result through a layer of sigmod, wherein the sigmod is a normalization function and is expressed as:

s5, using a two-class cross entropy loss function to measure the distance between the final prediction probability obtained in the step S4 and the real label, and outputting a classification result, wherein the classification result is represented as:

Loss＝BECLoss(final_pred_val,label)

to further illustrate the optimization effect of the method of the present invention, the CC data set and the CIFAR-100 data set were used for image classification and original image reconstruction experiments.

The experiment was trained on a GTX 1060Ti PC, with Adam selected as the optimizer and default parameter set to β₁＝0.9,β₂0.999. Further, the initial learning rate is set to 10^-3The learning rate decreases by 10% for each 1000 iterations until it decreases to 10^-6. To improve generalization and robustness, the input batch was subjected to several slight stochastic transformations including scaling, rotation, horizontal flipping, brightness and hue variation.

Training our model between 15-30 epochs, each period takes about 30 minutes, and the Top-1 test precision reaches 92.37%.

At test time, the same data set was tested in the original Meso _ inch 4 network for comparison. The results show that the original Meso _ initiation 4 network does not perform as well as the model with the heat map regression layer added.

In the deep fakes data set of faceforces + +, the accuracy of the model after testing is shown in table 1.

TABLE 1 DeepFakes data set test accuracy comparison table of faceForensics +

Model (model)	Rate of accuracy
		Original Meso _ initiation network	87.30％
The invention	91.37％

The method not only achieves the obvious performance improvement effect on the DeepFakes data set in the Facefforensics + +, but also analyzes the current Meso _ initiation network, and searches and utilizes the heat map rule of Meso _ initiation. The problem of unstable training results in the existing neural network is solved by combining a heat map regression neural network, and a novel network architecture is provided, namely a combined image classification network which mainly comprises a CNN-based feature extraction network with heat map feature distribution and assists a feature heat map regression network, so that the image classification precision is improved.

In an alternative embodiment of the present invention, based on the same inventive concept of the above-mentioned embodiment, the present invention further includes a storage medium having stored thereon computer instructions that, when executed, perform the steps of the CNN feature extraction network-based joint image classification method of joint heat map feature regression described above.

In an optional embodiment of the present invention, based on the same inventive concept of the above-mentioned embodiment, the present invention further includes a terminal, including a memory and a processor, where the memory stores thereon computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the above-mentioned CNN feature extraction network-based joint image classification method of joint heat map feature regression.

Based on such understanding, the technical solution of the present embodiment or parts of the technical solution may be essentially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A CNN feature extraction network-based combined image classification method based on combined heat map feature regression is characterized by comprising the following steps:

2. The joint image classification method based on CNN feature extraction network of joint heat map feature regression of claim 1, wherein the step S2 is specifically:

3. The method of joint image classification based on CNN feature extraction network of joint heat map feature regression of claim 2, wherein the CNN-based feature extraction classification network model with heat map feature distribution is provided with a ReSize layer after inputting image information, the ReSize layer is used to scale pictures to a size suitable for post-mapping neuron feature values.

4. The method of joint image classification based on CNN feature extraction network of joint heat map feature regression of claim 3, wherein the first three sets of convolutional layers, Rule layers, batch normalization layers and max pooling layers in the CNN-based feature extraction classification network model with heat map feature distribution are replaced with two sets of inclusion layers, wherein each set of inclusion layers comprises seven convolutional layers in parallel and one batch normalization layer.

5. The joint image classification method based on the CNN feature extraction network of the joint heat map feature regression of claim 4, wherein a LeakyRule activation function is added after a Rule layer in the CNN-based feature extraction classification network model with heat map feature distribution, and a dropout layer before a second fully connected layer is removed.

6. The method for joint image classification based on CNN feature extraction network of joint heat map feature regression of claim 5, wherein the feature heat map regression network model comprises a feature extraction section, a face recognition localization bank and a linear regression layer module in the CNN-based feature extraction classification network model with heat map feature distribution.

7. The joint image classification method based on CNN feature extraction network of joint heat map feature regression of claim 6, wherein the step S3 is specifically:

8. The joint image classification method based on CNN feature extraction network of joint heat map feature regression of claim 7, wherein the step S4 is specifically:

9. A storage medium having stored thereon computer instructions which when executed perform the steps of the method for joint image classification based on CNN feature extraction networks of joint heat map feature regression according to any one of claims 1 to 8.

10. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor when executing the computer instructions performs the steps of the method for joint image classification based CNN feature extraction network for joint heat map feature regression according to any one of claims 1 to 8.