CN112308089A

CN112308089A - Attention mechanism-based capsule network multi-feature extraction method

Info

Publication number: CN112308089A
Application number: CN201910689204.3A
Authority: CN
Inventors: 王耀彬; 卜得庆; 唐苹苹; 王欣夷; 李凌; 孟慧玲; 刘启川
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2021-02-02

Abstract

The invention discloses a capsule network multi-feature identification and extraction method based on an attention mechanism, which comprises the following steps: (1) designing an NCap network and constructing an attention-based capsule network framework by using the NCap network; (2) inputting an image training set to the attention mechanism capsule network, wherein the attention mechanism capsule network completes the recognition and extraction of image features after training and learning and generates a corresponding optimal training model; (3) inputting an image to be identified to the attention mechanism capsule network, and loading an optimal network model and identifying image characteristics by the attention mechanism capsule network; (4) and the attention mechanism capsule network outputs the identification result of the image to be identified. The invention provides an attention mechanism-based idea of fusing a convolution network mechanism and a capsule network structure, recording the relative position and direction of an image and reducing parameters during training, and effectively improving the recognition efficiency and accuracy.

Description

Attention mechanism-based capsule network multi-feature extraction method

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image multi-feature recognition and classification method based on an attention mechanism and combined with a capsule network in the technical field of image recognition. The method can be used for extracting and identifying the key information features of the image.

Background

In recent years, the trend of single attribute identification to multi-attribute identification has been developed in the technical fields of target identification and feature extraction, and the rapid innovation of re-identification technology is greatly promoted by the increasing maturity of the technology. However, the precise classification of multi-attribute recognition still has some difficulties, such as too high dimensionality of pixels, low resolution, and noise interference. At present, the methods of target identification and feature extraction basically adopt a convolutional neural network for training and learning. Convolutional layers are important components of convolutional neural networks, and the size of the convolutional kernel determines the richness of the feature map. Meanwhile, in order to reduce the calculation amount of the model, the convolutional neural network reduces the size of the feature map through pooling operation, but the use of the pooling layer easily causes the problems of loss of key information such as position, direction and the like. For example, in a human face image, after the relative positions of the eyes and the mouth are adjusted, when the image is input into the convolutional neural network for prediction, the result of recognizing the human face is far greater than the result of recognizing the non-human face.

Aiming at the problems of loss of key information such as position, direction and the like in the identification process of a convolutional neural network, a father Hinton of the neural network provides a solution measure such as a capsule network, the capsule network adopts vectors, namely the distance and the direction from a viewpoint to an image as input and output, and then a dynamic routing mechanism is used for carrying out iterative updating on parameters, but the calculation amount is quite large. Compared with a convolutional neural network, the capsule network uses two-dimensional data as input and output, so that loss of key information such as relative positions and directions of images in a training process can be prevented, and the recognition accuracy is improved.

The southwest university proposes a capsule network image classification and identification method for improving a reconstruction network in the patent document "a capsule network image classification and identification method for improving a reconstruction network" (patent application number: CN201810509412.6, application publication number: CN 108985316A). The method comprises the following specific steps: firstly, the method constructs a capsule network by constructing a working network and a proofreading network, selects the maximum numerical vector output by the working network as an image classification recognition result, and outputs the maximum numerical vector to margin loss to calculate the deviation between the recognition result and a real result. The reconstruction network then reverts the recognition result back to an image and compares it with the input image to derive a variance. And finally, adding the deviation amount and the variance, feeding the sum back to a working network, and continuously training and learning to achieve the aim of accurately identifying the image. According to the method, the deviation value and the variance are obtained by constructing the working network and the correcting network and are fed back to the working network for dynamic tuning, so that the identification accuracy is improved. However, this method still has disadvantages: the deviation and variance calculated by the working network and the calibration network are fed back to the working network, which inevitably affects the network calculation rate and energy consumption, and the method is only suitable for image recognition of single-feature attributes.

Xihui Liu et al propose an attention-based depth network HydraPlus-Net in its published paper "HydraPlus-Net: attention Deep cultures for Peer Analysis" (ICCV 2017). The method maps an image to be recognized to different feature layers in a multi-way and multi-range mode through an attention mechanism network, captures local and global information, and collects features according to semantics of different layers. The network mainly comprises a main network and an attention network, wherein the structures of the main network and the attention network are convolution neural network structures and shared convolution frames thereof, output cascade is carried out, finally global average pooling and full connection layer fusion are used, and output mapping is carried out on characteristic attributes for multi-attribute identification or characteristic vectors for re-identification. The method uses an attention mechanism network to identify the image attributes of multiple features and obtains a good identification result. However, the method still has the disadvantages that the main network and the attention network still use the mainstream convolutional neural network structure in the overall network framework of the method, and the most fatal problem of the convolutional neural network structure is that the use of the pooling layer causes the problem that important information such as position, direction and the like is lost, and the problem that the multi-feature attribute identification accuracy is never high is still caused.

Disclosure of Invention

Aiming at solving the problems of low accuracy in the process of multi-attribute feature recognition and loss of important information such as position, direction and the like of a convolutional neural network in the training process, the invention provides a capsule network multi-feature extraction and recognition method based on an attention mechanism, aiming at the current situation that the fields such as target recognition, feature extraction and the like all use the convolutional neural network as a main method. In the training process, the identification area is dynamically adjusted, the capsule network technology is combined with the convolutional neural network technology, and important information such as the relative position and direction of an image is kept from a training source by comparing and feeding back calculation errors between the capsule network and the convolutional network, so that the problem that the identification accuracy is reduced due to the fact that the important information is lost when data are trained is solved.

In order to achieve the above purpose, the present invention starts from the "concentration" area of the training data, and sets an adjustable convolution layer between the input layer of the network architecture and the main network and the attention network, wherein the layer can dynamically adjust the deep semantics, shallow semantics and local semantics of the identified image to achieve the purpose of dynamically identifying the specific area. In a main network and an attention network, an Ncap network with a capsule network and a convolution network fused is used, and the capsule network multi-feature identification and extraction method based on the attention system is realized by comparing and feeding back error parameters, dynamically adjusting identification network weights and corresponding calculation loss, and comprises the following specific steps:

step 1: constructing an attention machine capsule network, wherein the attention machine capsule network comprises an input layer, an adjustable convolution layer, a main network, an attention network, a global average weight, a full connection structure and an output layer, the input layer is used for inputting training data and identification data, the adjustable convolution layer is used for adjusting the semantic range of an identification image, the main network is used for extracting the overall semantics of a character image under different scales, the attention network is used for extracting the shallow semantics and the local semantics of the character image under different scales, and the main network and the attention network are sequentially connected with the global average weight, the full connection structure and the output layer after being converged; the main network comprises 3 serially connected Ncap networks, the input end of the Ncap network is connected with the input end of the data image, and the output end of the Ncap network is connected with the global weight calculation;

the attention network comprises 3 layers and 3 cascaded Ncap networks, wherein the input end of each layer of Ncap network is connected with the input end of an image, and the output end of each layer of Ncap network is connected with a global weight calculation;

the Ncap network comprises a working network and a proofreading network, wherein the working network is used for inputting an image and outputting an identification result of the image, and the proofreading network is used for comparing and feeding back training adjustment parameters to the working network;

the working network comprises a convolution structure and a full connection structure. The output end of the convolution layer of the convolution structure is respectively connected with the pooling layer of the convolution structure in the working network and the capsule layer of the proofreading capsule in the proofreading network, the output end of the pooling layer of the convolution structure is respectively connected with the input end of the convolution layer of the convolution structure in the working network and the proofreading layer of the proofreading capsule in the proofreading network, and the Nth pooling layer of the convolution structure is connected with the weight calculation of the full-connection structure in the working network. The full connection structure is a network structure of a weight calculation layer and a full connection layer in sequence;

the proofreading network comprises a proofreading capsule, a loss layer and an optimization algorithm layer, wherein the proofreading capsule comprises a capsule, a proofreading layer and a proofreading image error loss, the input end of the capsule is connected with a convolution layer of the working network, the input end of the proofreading layer is respectively connected with the output end of the capsule and a pooling layer of the working network, the output end of the proofreading layer is connected with the image error loss, the output end of the image error loss layer is respectively a convolution layer next to a convolution structure in the working network and the loss layer in the proofreading network, and the input end and the output end of the optimization algorithm layer are respectively connected with the loss layer and the working network;

step 2: inputting an image training set to the attention mechanism capsule network, completing feature extraction of images after training and learning by the attention mechanism capsule network, and outputting an optimal network training model;

and step 3: inputting an image to be recognized to the attention mechanism capsule network and loading an optimal network training model, wherein the output of the working network is the obtained recognition characteristic;

and 4, step 4: and the capsule network outputs the characteristic result of the image to be identified.

The existing characteristic recognition network structure is a convolutional neural network, so that the problems of loss of important information such as relative positions and directions of images and the like are easily caused, the calculation amount of the existing capsule network is large when vector transformation calculation is carried out, through the design, the advantages that the capsule network records the important information such as the relative positions and the directions of the images and the like are used, the advantages that the convolutional neural network is small in calculation amount and the like are used, the advantage that the network is high in accuracy is guaranteed in the training process, and the problems that energy consumption of running hardware of the network is low and the like are also guaranteed.

Further, the specific process of training the attention-based capsule network framework in step 2 is as follows:

s2.1, inputting the images in the image training set into an adjustable convolution layer, and obtaining multi-scale image information data D after convolution operation after adjusting the parameter t of the layer;

s2.2, transmitting the image information data D with the large-scale shallow semantic range to a main network consisting of NCaps, and calculating to obtain image global feature information I₁；

S2.3, transferring image information data D in a small-scale deep semantic range to NCap to form an attention network, and calculating to obtain local feature information I of the image₂；

S2.4, obtaining the global feature information I through calculation₁And local feature information I₂Input after mergingAnd calculating the global weight, performing full-connection operation, and performing classified output on the operation.

The specific steps of the NCap network training image information data of the main network and the attention network are as follows:

q1, the image information data D is transmitted to the NCap working network, and before the convolution operation is carried out on the convolution structure of the working network, capsule in the proofreading capsule of the proofreading network records the position direction information of the capsule;

q2, after the convolution and pooling operation of the convolution structure of the working network, transferring the pooled image information data to the collation layer in the collation capsule of the collation network;

q3, comparing the image information in the capsule with the image information stored in the proof layer after pooling operation to obtain an image error loss;

q4, repeating the operations of S2 and S3, and cascading the n losses to obtain the final loss;

q5, the final loss is optimized by an optimization algorithm and then fed back to the working network;

and Q6, adjusting parameters of each layer from back to front in a reverse order by the working network until the identification accuracy of the working network is constant, and finishing training and learning of the NCap network.

The invention has the following beneficial effects:

(1) the invention avoids the problem of losing important information such as relative position and direction of the image from the source of data training, and uses the attention mechanism to identify the specific area, so that the invention can extract more abundant and perfect characteristic information, thereby improving the accuracy of classification and identification of the image in the identification process.

(2) The network structure used by the invention and formed by fusing the capsule network and the convolution network is more suitable for the field of extracting various character attributes and multi-feature extraction and identification in the current field of feature extraction and target identification, and is more in line with the extraction of fine-grained features.

(3) The method combines the advantages of the capsule network and the convolutional network, not only solves the problems of large calculated amount and the like of the capsule network, but also solves the problems of information loss and the like of the convolutional network to high-level characteristics in the deep learning process, and the method has universality.

Drawings

FIG. 1 is a flowchart of an overall attention-based capsule network multi-feature extraction network according to an embodiment of the present invention;

FIG. 2 is a principal network NCap network architecture of an embodiment of the present invention;

FIG. 3 is an overall framework of an attention-directed capsule network of an embodiment of the present invention;

FIG. 4 is an illustration of the effect of the embodiment of the invention on the overall network framework;

fig. 5 is an illustration of the effect of the NCap capsule network structure according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

In order to realize the problems of low accuracy in the process of multi-attribute feature recognition and loss of important information such as position, direction and the like of a convolutional neural network in the training process, the technical scheme adopted by the invention is as follows: firstly, designing a capsule network NCap by combining the advantages of the capsule network and the convolution network, and constructing a capsule network framework based on an attention mechanism by using the NCap; then acquiring image data from the public data set, training and learning the image data in a capsule network framework, and completing extraction and identification of image features and generating an optimal training model after the attention mechanism capsule network training and learning; in the identification process, the image to be identified is input to the attention mechanism capsule network, and the optimal training model file is loaded, so that the identification result can be obtained and output. The flow chart is shown in fig. 1.

The implementation of the overall network framework of the embodiment comprises the following specific steps:

designing an NCap network, and constructing an attention capsule network framework by using the NCap network;

the NCap network is provided with a working network and a proofreading network as shown in FIG. 2, wherein the working network is used for inputting an image and outputting an identification result of the image, and the proofreading network is used for comparing and feeding back training adjustment parameters to the working network;

the attention mechanism capsule network framework is shown in fig. 3, the attention mechanism capsule network comprises an input layer, an adjustable convolution layer, a main network, an attention network, a global average weight, a full connection structure and an output layer, the input layer is used for inputting training data and identification data, the adjustable convolution layer is used for adjusting the semantic range of an identification image, the main network is used for extracting the overall semantics of a character image under different scales, the attention network is used for extracting the shallow semantics and the local semantics of the character image under different scales, and the main network and the attention network are connected with the global average weight, the full connection structure and the output layer in sequence after being converged;

the main network comprises 3 serially connected Ncap networks, the input end of the Ncap network is connected with the input end of the data image, and the output end of the Ncap network is connected with the global weight calculation;

the attention network comprises 3 layers and 3 cascaded Ncap networks, wherein the input end of the Ncap network is connected with the input end of an image, and the output end of the Ncap network is connected with the global weight calculation;

inputting an image training set to an attention mechanism capsule network, completing recognition and extraction of image multi-feature characters after the attention mechanism network training is completed, and storing an optimal training network model;

the training identification image used in the present embodiment is a pedestrian re-identification data set PA-100K, which includes hundreds of thousands of images corresponding to 26 human identification attributes. 80% of the data set was used as the training set, 10% of the data set was used as the validation set, and 10% of the data set was used as the test set.

An example of the effect of each NCap network of the overall attention mechanism capsule network framework is shown in fig. 4, in which the specific steps of training the re-recognition data set are as follows:

s2.1, inputting an image data set PA-100K to be trained into a Conv-layers of an attention system capsule network frame, wherein the Conv-layers set the identified granularity and range by setting the size of a corresponding convolution kernel;

s2.2, the image information I with shallow semantic meaning, namely large granularity, generated in the step S2.1₁Transmitting the training data to a main network for training;

s2.3, the deep semantic meaning generated in the step S2.1, namely the image information I with small granularity_2-1、I_2-2And I_2-3The information is transmitted to an attention network for training;

and S2.4, carrying out global weight calculation and full connection on the character attribute identification characteristic information output by the main network and the attention network, and then carrying out classified output.

In the above steps, the image data enters the NCap network in the main network and the attention network for training and learning, and the network effect is shown in fig. 5, which includes the following specific steps:

q1, the image information data I is transmitted to the NCap working network, and the convolution structure of the working networkBefore convolution operation is carried out on the convolution layer N, position and direction information D of the capsule recorded image in the proofreading capsule of the proofreading network is proofread₁；

Q2, after convolution and pooling operation of convolution structure of working network, transferring image information data after pooling operation to collation layer in collation capsule of collation network to obtain image information D₂；

Q3, image information data D produced by comparing the steps Q1 and Q2₁And D₂Obtaining an image error LOSS, and transmitting the image error LOSS into a convolutional layer N +1 in a working network and a LOSS layer in a proofreading network;

q4, repeating the operations of Q2 and Q3, and cascading the obtained n losses to obtain the final loss;

Inputting an image to be recognized to the attention-making capsule network, loading an optimal training network model to the attention-making capsule network, and outputting a result, namely a recognition characteristic result;

and step four, outputting the character attribute feature recognition result of the image to be recognized by the capsule network.

The foregoing is only illustrative of the present invention. The convolutional neural network has an advantage of a small amount of calculation, but has a disadvantage of important information such as a directional position, and the capsule network has an advantage of recording important information such as a directional position, but has a disadvantage of a large amount of calculation. Therefore, the method combines the two and combines an attention mechanism to realize extraction and identification of the character characteristic information in multiple scales and ranges, thereby achieving the advantages of high accuracy, identification of multiple attributes and the like.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and changes may be made without inventive changes in the technical solutions of the present invention.

Claims

1. A capsule network multi-feature extraction method based on an attention mechanism is characterized by comprising the following steps:

step 1: constructing an attention machine capsule network, wherein the attention machine capsule network comprises an input layer, an adjustable convolution layer, a main network, an attention network, a global average weight, a full connection structure and an output layer, the input layer is used for inputting training data and identification data, the adjustable convolution layer is used for adjusting the semantic range of an identification image, the main network is used for extracting the overall semantics of a character image under different scales, the attention network is used for extracting the shallow semantics and the local semantics of the character image under different scales, and the main network and the attention network are sequentially connected with the global average weight, the full connection structure and the output layer after being converged;

2. The attention-based capsule network multi-feature extraction method of claim 1, wherein: the specific process of the capsule network training and learning in the step 2 is as follows:

s2.2, transmitting the image information data D with the large-scale shallow semantic range to a main network consisting of NCaps, and calculating to obtain image global feature information I1;

s2.3, transmitting image information data D in a small-scale deep semantic range to NCap to form an attention network, and calculating to obtain local feature information I2 of the image;

and S2.4, merging the global characteristic information I1 and the local characteristic information I2 obtained by calculation, inputting the merged information into global weight calculation, performing full-connection operation, and classifying and outputting the operation.

3. The attention-based capsule network multi-feature extraction method of claim 1 or 2, wherein: the specific mode of the NCap network training image information data is as follows:

4. The attention-based capsule network multi-feature extraction method of claim 1, 2 or 3, wherein: the NCap network structure comprises a working network and a proofreading network, wherein the working network comprises a convolution structure and a full-connection structure, the proofreading network comprises n proofreading capsules, LOSS layers and an optimization algorithm layer, the NCap network operation mechanism mainly utilizes image information before convolution and image information after convolution to generate a LOSS error, the LOSS error is fed back to a next layer of convolution layer in the working network convolution structure, n losses generated by the n proofreading capsules are cascaded into the LOSS layer, and the LOSS error is fed back to the working network after the optimization algorithm and the network parameters are adjusted in a reverse order.