CN114386569A

CN114386569A - Novel image description generation algorithm using capsule network

Info

Publication number: CN114386569A
Application number: CN202111572920.7A
Authority: CN
Inventors: 于红; 刘晗; 刘元秋; 刘雨欣
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-22
Anticipated expiration: 2041-12-21
Also published as: CN114386569B

Abstract

A novel image description generation algorithm using a capsule network comprises the steps of firstly processing region level image characteristics by using a multi-channel bilinear pooling attention module, and processing the region level characteristics through a bilinear pooling attention mechanism and an extrusion-reward operation to obtain multi-channel attention visual characteristics; then, inputting the multichannel characteristics into a capsule network, taking each dimension of the region-level characteristics as an activity vector in a bottom-layer capsule, and aggregating the region-level characteristics into global-level image characteristics through dynamic routing calculation; and finally, decoding and using the hidden layer vector of the LSTM, the image characteristics and the word and word vector generated at the previous moment as the input of the next moment, and updating the characteristics and the hidden layer state by using a bilinear pooling algorithm so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. The invention realizes that the capsule network is used for capturing the relative position relation in the image description generation process and generating better image description.

Description

Novel image description generation algorithm using capsule network

Technical Field

The invention belongs to the field of artificial intelligence, and mainly relates to a novel image description generation algorithm using a capsule network.

Background

The image description generation task connects two major directions in the field of artificial intelligence, computer vision and natural language processing. In real life, people can automatically establish the relation among visual characteristic information such as scenes, objects and the like in an image and perceive high-level semantic information in the image, but a computer cannot understand and sort the information like the human brain, and the image description generation task aims to convert image characteristics into semantic information so as to provide help for the computer to better understand the content of the image. In order to realize the conversion from image to text, early related work mainly started from two aspects of template and retrieval, or filling the detected object name into a language template to realize the generation of description, or retrieving a similar picture and modifying the description of the similar picture to generate picture description. However, both of these approaches have certain drawbacks: the description generated by the template-based method is limited to a fixed length and the format is not variable; retrieval-based methods rely on data sets, cannot adapt to new pictures, and are difficult to generate high-quality image descriptions.

The framework of the current classical method for generating various image descriptions is an encoder-decoder structure, and research is mainly focused on image feature processing and attention mechanism application. The image feature processing is mainly focused on an encoder part, and features of different areas and different levels are extracted from a picture and then processed, so that the image description quality is improved. For example, the SCA-CNN method analyzes the characteristics of spatiality, multiple layers and multiple channels of the convolutional neural network, and achieves better effect after combining channel attention and space attention. And the Bottom-up method selectively extracts the picture region characteristics through target detection and entity identification, thereby generating more accurate and complete picture description. The X-Linear method uses bilinear pooling of space and channels to obtain second-order interaction of image features, and enhances the expression capability of the model.

The main goal of the attention mechanism work is to enhance the correlation between image regions and words to obtain more semantic details. The visual sentinel method allows an attention mechanism to decide to pay attention to an image or a sentence by itself, thereby generating a corresponding entity word or preposition. The backtracking and prediction method draws attention to the extension to the range of two words, so that the description is more coherent and more accords with the habit of human language. The method of the scene graph enables the algorithm to pay more attention to the entities, attributes and relationships among the entities in the picture, and the accuracy of description is improved. After the Transformer model is applied, the attention mechanism is improved more deeply, and the obtained effect is better.

At present, the method of processing image features to obtain deeper information is a general direction, and can be used as a front method of an attention mechanism, and then fusing two parts can generate image feature representation with higher quality. The existing visual attention mechanism can focus on different positions of pictures in the process of generating a text sequence so as to select corresponding words, but the attention transfer cannot focus on the relative position relation of objects in the pictures in space. The present invention uses a capsule network to improve the attention mechanism, making it possible to take full advantage of the spatial information conveyed in the image to generate a more accurate and detailed description.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a novel image description generation algorithm using a capsule network, and the spatial relative position relation is captured through a transformation matrix, so that the problem that the traditional visual attention mechanism cannot fully capture the spatial relation is solved.

The technical scheme of the invention is as follows:

a novel image description generation algorithm using a capsule network, characterized by the steps of:

(1) processing region-level image features using a bilinear pooling attention module with multiple channels;

taking a region level image feature matrix F and embedding a feature vector Q_E、K_E、V_E，K_EAnd V_EAre all initialized to F, Q_EInitialization as an average pooling of all region-level image features

Wherein f is_iThe ith dimension of F, N the number of region-level image features, Q_E、K_E、V_EThe query vector, the relevance vector and the queried vector in the attention mechanism;

first, calculate Q with low rank bilinear pooling algorithm_EAnd K_EGet an intermediate representation of the product of the ith dimension ki

To Q_EAnd V_EThe ith dimension vi is obtained by bilinear pooling calculation

Wherein,

are each k_i、Q_E、v_i、Q_EThe embedded matrix is an element multiplication operation between the matrixes, and sigma is a nonlinear activation function;

for intermediate representation

Global average pooling by squeeze awards

And capture its channel dependence alpha^c；

Wherein,

W_Bare respectively as

N is

σ is a nonlinear activation function;

according to α c and

obtaining a visual representation of multiple channels

Wherein,

is composed of

The ith dimension of (a);

(2) extracting an image-level visual representation using a capsule network;

visual representation of multiple channels

Inputting into capsule network for 2-4 times of dynamic routing operationUpdating parameters of the capsule network to obtain a final image-level visual representation

The capsule network operation formula is as follows:

wherein, W_i ^fIs mu_iThe transformation matrix of (a) is,

for corresponding mu in capsule network_iThe coupling coefficient of (a);

the spatial coupling coefficient updating formula of the capsule network is as follows:

wherein, b_i、b_jFor dimension i, j, b of the routing matrix in the capsule network_iBy accumulating mu_iAnd

self-updating the product of (a);

(3) visual representation of image level using LSTM and bilinear pooling modules

Decoding to obtain image description;

the decoder comprises a layer of LSTM, a word is generated through the LSTM layer and the bilinear pooling module at each time step, T time steps are circulated, and a description sentence with the length of T is finally obtained, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooled

And image-level visual representation

By a joint representation of

Context vector c calculated at time t-1_t-1And a word vector s generated at time t-1_t-1Splicing is x_tInputting LSTM to obtain a hidden layer vector h_tAnd output to bilinear pooling module and GLU module to obtain context vector c_tGenerating a word st after softmax operation;

the LSTM input x_tThe calculation formula of (a) is as follows:

wherein, W_F、W_xIs an embedded matrix;

the hidden layer vector h_tThe calculation formula of (a) is as follows:

h_t＝LSTM(x_t，h_t-1)

wherein h is_t-1Is a hidden layer state matrix of the LSTM at the time t-1;

said context vector c_tThe calculation formula of (a) is as follows:

c_t＝GLU(F_X-Linear(K_D，V_D，Q_D))

wherein, F_X-LinearA calculation function, K, for a bilinear pooling block_D、V_D、Q_DFor relevance vectors, queried vectors, query vectors, K in a bilinear pooling module_DHidden layer state h initialized to LSTM_t，V_D、Q_DInitialization to a visual joint representation

The formula for generating words is as follows:

s_t＝softmax(W_cc_t)

wherein s is_tFor words generated at time t, W_cIs c_tThe embedded matrix of (2).

The nonlinear activation function is a CELU activation function.

The invention has the beneficial effects that: the invention provides a novel image description generation algorithm using a capsule network. Second-order interaction of image features on the channels is achieved through bilinear pooling attention calculation with multiple channels, and relative position relation of regional features on the space is captured through a capsule network, so that a decoder is guided to notice the position relation between entities in a sentence, words reflecting the position relation are accurately generated, and image description with higher quality is generated.

Drawings

FIG. 1 is a flow chart of an image description generation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the capsule network effect according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an algorithm framework according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

Fig. 1 is a flowchart of an image description generation method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides such an improved image description generation method using a capsule network, including:

101, processing region level image characteristics by using a multi-channel bilinear pooling attention module;

step 102, extracting an image-level visual representation using a capsule network;

step 103, visual representation of image level using LSTM and bilinear pooling modules

And decoding is carried out, and image description is obtained after a plurality of time steps are cycled.

In the embodiment of the present invention, the region-level features of the picture correspond to N vectors with length of 2048, which respectively represent corresponding features of N sub-regions in the picture, where the N regions include information of main objects in the picture. Based on region-level image features, we use it to initialize the input Q of a bilinear pooling attention Module with multiple channels_E、K_E、V_E，Q_EAverage pooling matrix set as region level feature, K_E、V_ESet as a region level feature and then input into a bilinear pooling attention module with multiple channels for computation.

And obtaining a regular image feature matrix through low-rank bilinear pooling, wherein the dimension of the regular image feature matrix is Nx 1024, and the regular image feature matrix reflects the image features of the region level. The attention capsule network takes each dimension of the region-level features as an activity vector in a bottom-layer capsule, and through dynamic routing calculation, the spatial relationship information between the salient region and the whole image is retained in a transformation matrix of the dynamic routing, so that the region-level features are aggregated into global-level image features. FIG. 2 is a diagram illustrating the effect of the capsule network according to an embodiment of the present invention on a vector

Different posture transformation is realized through different transformation matrixes, and the posture transformation comprises translation, rotation, scaling and the like. Similarly, each dimension of the region feature represents a significant region, and when the capsule network calculation is carried out, the relative position relation among the regions is kept in the continuously updated transformation matrix, so that the unique global image feature is updated.

The decoding stage, step 103, uses the image-level visual representation generated in step 102, and the hidden layer state in LSTM and the word-word vector generated at the previous time to update the hidden layer state and obtain the output vector to generate the word output at the current time. The word vectors generated at the last moment are fused into the input, directing the decoder to generate sentences that better conform to the human language specification. In the decoding stage, a bilinear pooling attention module is still needed, the input of the bilinear pooling attention module is a hidden layer state and a region level characteristic, the hidden layer state reflects the direction of attention transfer, and the result of bilinear pooling calculation enables the hidden layer state to highlight attention through the fusion of the region level characteristic, so that the generation of words at the next moment is promoted. And finally generating a description sentence corresponding to the picture by the decoder through the LSTM model with multiple time steps.

Fig. 3 is an overall framework diagram of the algorithm, which completely reflects the whole process from one picture to the generation of the description. After an image is input, firstly processing the region level image characteristics by using a multi-channel bilinear pooling attention block, and processing the region level characteristics through bilinear pooling and extrusion reward operation to obtain multi-channel visual characteristics; then, inputting the multi-channel characteristics into an attention capsule network, taking the capsule network of each dimension as a low-level capsule, and aggregating the region-level characteristics into image characteristics of a global level through dynamic routing calculation; and finally, decoding and using the hidden layer vector, the image characteristic and the word and word vector generated at the previous moment of the LSTM as the input of the next moment, and updating the characteristic and the hidden layer state by using bilinear pooling so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. In summary, the following steps: the invention provides an improved image description generation method using a capsule network. Second-order interaction of captured image features is calculated through multi-channel low-rank bilinear pooling, and relative position relation among regional features is extracted through a capsule network, so that a decoder is guided to notice position relation and space information among entities in sentences and generate words reflecting the space position relation, and image description with higher quality is obtained.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A novel image description generation algorithm using a capsule network, characterized by the steps of:

first, Q is calculated by using low-rank bilinear pooling_EAnd K_EGet an intermediate representation of the product of the ith dimension ki

To Q_EAnd V_EIs subjected to bilinear pooling to obtain

Wherein,

for intermediate representation

Global average pooling by squeeze awards

And capture its channel dependence alpha^c；

Wherein,

W_Bare respectively as

N is

σ is a nonlinear activation function;

according to alpha^cAnd

obtaining a visual representation of multiple channels

Wherein,

is composed of

The ith dimension of (a);

(2) extracting an image-level visual representation using a capsule network;

visual representation of multiple channels

Inputting into capsule network for 2-4 times of dynamic routing operation, updating capsule network parameter after each operation to obtain final image level visual representation

The capsule network operation formula is as follows:

wherein, W_i ^fIs mu_iThe transformation matrix of (a) is,

for corresponding mu in capsule network_iThe coupling coefficient of (a);

self-updating the product of (a);

Decoding to obtain image description;

the decoder comprises a layer of LSTMs in each of whichGenerating words through an LSTM layer and a bilinear pooling module at the time step, and circulating T time steps to finally obtain a description sentence with the length of T, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooled

And image-level visual representation

By a joint representation of

Context vector c calculated at time t-1_t-1And a word vector s generated at time t-1_t-1Splicing is x_tInputting LSTM to obtain a hidden layer vector h_tAnd output to bilinear pooling module and GLU module to obtain context vector c_tGenerating words s after softmax operation_t；

The LSTM input x_tThe calculation formula of (a) is as follows:

wherein, W_F、W_xIs an embedded matrix;

the hidden layer vector h_tThe calculation formula of (a) is as follows:

h_t＝LSTM(x_t，h_t-1)

wherein h is_t-1Is a hidden layer state matrix of the LSTM at the time t-1;

said context vector c_tThe calculation formula of (a) is as follows:

c_t＝GLU(F_X-Linear(K_D，V_D，Q_D))

The formula for generating words is as follows:

s_t＝softmax(W_cc_t)

2. The novel image description generation algorithm using capsule network according to claim 1, characterized in that the nonlinear activation function is a CELU activation function.