CN114782848A

CN114782848A - Picture subtitle generating method applying characteristic pyramid

Info

Publication number: CN114782848A
Application number: CN202210233662.8A
Authority: CN
Inventors: 徐萍; 毕东
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-22
Anticipated expiration: 2042-03-10
Also published as: CN114782848B

Abstract

The invention discloses a method for generating a picture subtitle by applying a characteristic pyramid, which comprises the following steps: inputting the preprocessed picture into a feature pyramid module, extracting a feature picture as picture feature information, and combining the feature picture with the preprocessed picture to be called as picture features with three different scales of low dimensionality, high dimensionality and original dimensionality; sending the original dimensional picture characteristics into an embedding layer to be converted into vector representation; sending the picture characteristics of three different scales into a first layer of an encoder, and carrying out dimension scaling; sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, splicing to obtain fused picture characteristics, and sending the fused picture characteristics to a decoder of a model for decoding; and performing gradient updating through a cross entropy loss function, and optimizing the weight of the model to obtain the image caption generating method. The invention enhances the semantic expression capability of the picture from different angles and different visual field distances of the picture, and effectively reduces the calculation cost of a self-attention mechanism and a feedforward neural network in an encoder.

Description

Picture subtitle generating method applying characteristic pyramid

Technical Field

The invention relates to an image and language processing technology, in particular to a picture subtitle generating method based on a characteristic pyramid.

Background

Image captioning (Image Caption) can be considered as a global object detection task, which is to generate a sentence describing the content of a picture from the whole picture. Early image subtitle generation methods were based on traditional machine learning, and included extracting features of images using some image processing operators, classifying using a support vector machine or the like to obtain targets in the images, and then using the obtained targets and their attributes as the basis for generating sentences, for example, using CRF or some established rules to restore descriptions of images. Such a practice is not ideal in practical applications and depends heavily on 1) the extraction of image features 2) the rules required when generating sentences.

Deep learning has facilitated the rapid development of computer vision. Image coding and feature extraction greatly benefit from the development of CNN. With the emergence of deep CNN encoders such as VGG (video graphics generator), the accuracy of tasks such as image recognition is rapidly improved. Due to the strong image feature extraction capability of the CNN, it is a mainstream practice to use a deep CNN network as an image feature encoder in an image capture task. Google proposed a Neural Image capture model in 2014 as a mountain-opening work for this approach. The subsequent models with the greater impact of Neural Talk et al on Image capture development almost follow this basic framework.

As the Transformer model becomes more popular and more advocated in the field of natural language processing, there is now much work in the picture field trying to extract more powerful image features using transformers. Vision transformers, currently developed from transformers, have been effective in achieving good results for each large picture task. As shown in fig. 1, the Vision Transformer still adopts the structure of a codec, and encodes and decodes the picture features and the position information of the sub-picture by using an attention mechanism. In the attention calculation process, a multi-head segmentation mode is adopted, so that different heads pay attention to information of different picture semantic spaces. Attention mechanisms are in addition to self-attention, codec attention. They differ in that: the self-attention query vector, the key vector and the value vector are all intermediate vectors of the same layer, the codec attention query vector is an intermediate vector at the decoding end, and the key vector and the value vector are source language coding vectors output by the coding end.

As shown in fig. 2, the feature pyramid model was originally proposed in the image domain, and the image domain task generally employs a convolutional neural network, which naturally takes the shape of a pyramid due to the existence of pooling layers. In the target detection task, the sizes of different targets are different, and the extracted feature thickness granularity of different layers in the convolutional network is different, so that the target detection task can be regarded as picture feature information of different sizes.

Due to the feature extraction of the feature pyramid, the dimensions of the image features are changed, become inconsistent and cannot be fused, so that the feature pyramid structure cannot be used for enhancing the image information in practical application to obtain stronger image representation.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a picture subtitle generating method applying a feature pyramid. Before the picture is sent to a Vision Transformer model for coding and decoding picture features, the feature pyramid model is beneficial to more fully extracting the picture features.

In order to solve the technical problems, the invention adopts the technical scheme that:

a method for generating a picture subtitle by applying a feature pyramid comprises the following steps:

1) inputting the preprocessed picture into a feature pyramid module, extracting features of the picture through a multilayer convolutional neural network in the feature pyramid module, respectively extracting feature graphs of a low-level convolutional neural network and a top-level convolutional neural network, and taking the feature graphs as picture feature information of two scales of low dimensionality and high dimensionality and the preprocessed picture which are called as picture features of three different scales of low dimensionality, high dimensionality and original dimensionality;

2) sending the original dimensional picture characteristics into an embedding layer to be converted into vector representation;

3) sending the three picture characteristics with different dimensions into a first layer of an encoder, and carrying out dimension scaling, namely scaling the picture characteristics with different dimensions into hidden layer information with the same dimension through a self-attention mechanism and a feedforward neural network;

4) sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, and obtaining fused picture characteristics through splicing operation;

5) and sending the fused picture characteristics into a decoder of the model for decoding, decoding the picture characteristics into picture subtitles by the decoder through stacked decoder layers, performing gradient updating through a cross entropy loss function, and optimizing the weight of the model to obtain the picture subtitle generating method.

In the step 1), preprocessing picture data, inputting the picture into a feature pyramid module, and extracting features through a multilayer convolutional neural network, wherein the convolutional neural network is calculated in the following way:

weight(i,j)＝w[:,:,i:j]

input(x,k)＝x[:,:,k]

x is belonged to R in formula^H×W×CExpressing tensor of the pictures in a computer, H expressing the height of the pictures, W expressing the width of the pictures, C expressing the number of channels of the pictures, and expressing 2D cross-correlation operation in a formula

Represents a convolution kernel wherein c_outNumber of channels representing output characteristics, C_inThe number of channels representing the input characteristics, w [::, ij]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Expressing a kth tensor in a third dimension of x, wherein b in the formula expresses a bias constant; weight (i, j) represents the jth convolution kernel of channel i in the convolutional neural network, and input (x, k) represents the outputThe tensor of the k channel into x;

taking the output of the first layer of convolutional neural network as a low-dimensional picture characteristic, and taking the output of the last layer of convolutional neural network as a high-dimensional picture characteristic; the original picture is used as the original dimension picture characteristic.

Step 2) sending the original dimension picture characteristics to an embedding layer to be converted into vector representation, namely, adjusting the height and width of the original dimension picture characteristics to a specified size, dividing the original dimension picture characteristics into sub-pictures with fixed sizes, wherein each sub-picture is called a patch, and then sending the sub-pictures to the embedding layer to obtain the code of each patch, namely, a picture embedding vector, and specifically, the method comprises the following steps:

201) dividing the picture (batch, c, h, w) into sub-pictures with the resolution of p1 p2 for each batch, firstly, cutting each original dimension picture into (h/p1) small blocks (w/p2), namely from (batch, c, p1 (h/p1), p2 (w/p2)) to (batch, c, (h/p1) (w/p2), p1 p2), and converting the small blocks into (batch, (h/p1) (w/p2), p1 p2 c), which is equivalent to being divided into (h/p1) (w/p2) batches, wherein the dimension of each batch is p1 p 2; the implementation of this process is calculated as follows:

x＝rearrange(b,c,(h*p1),(w*p2)→b,(h*w),(p1*p2*c)′)

wherein, the rearrange function is an operator of the einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and the width of the image respectively.

202) After the sub-pictures are divided, the embedded vector of the original dimension picture characteristic is obtained, and the dimension of the embedded vector is adjusted, namely the dimension is adjusted to the required size through a layer of full connection layer.

In step 3), the picture features with different dimensions are sent to a first layer of an encoder, the first layer of the encoder is composed of three coding layers with different dimensions and respectively corresponds to the three picture features, wherein the coding layers are composed of an attention mechanism and a feedforward neural network, and the multi-head attention mechanism is calculated in the following manner:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein Q, K, V is the input vector, head, of the model_iThe vector of the ith head, W is a translation model parameter, Attention (·) is an Attention mechanism function, and Concat (·) is a vector connection function;

the feedforward neural network layer FNN is calculated as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

where x is the hidden layer vector, W₁、W₂、b₁、b₂The parameters of the model are obtained by automatic learning of the model;

301)(batch，l₀，d₀) Is the original picture feature (batch, h)₁，w₁，d₁) Is a low-dimensional picture feature (batch, h)₂，w₂，d₂) For the high-dimensional picture characteristics, the low-dimensional and high-dimensional picture characteristics are subjected to dimension transformation to obtain (batch, h)₁*w₁，d₁) And (batch, h)₂*w₂，d₂) (ii) a Sending the three picture characteristics to a multi-head self-attention machine to obtain₀D) and (batch, l)₁D) and (batch, l)₂D) picture features with consistent dimensions;

302) and then the three picture characteristics are sent into a feedforward neural network, and hidden layer information with the same third dimension is still obtained.

The invention has the following beneficial effects and advantages:

1. the invention uses the form of a characteristic pyramid to extract richer picture characteristic information from different scales, so that a coder captures picture semantic information of different scales;

2. the encoder effectively converts the dimensionality of the picture characteristics of different scales into consistency by arranging a multi-head self-attention mechanism with dimensionality normalization at the first layer so as to facilitate characteristic fusion;

3. according to the invention, only one characteristic pyramid structure and two additional sub-layer structures of the first layer of the encoder are added, and the generation quality of the image captions is improved under the condition that the model parameters are hardly increased.

Drawings

FIG. 1 is a diagram of a Vision Transformer prototype model according to the present invention;

FIG. 2 is a diagram of feature pyramid extracted picture features in the present invention;

FIG. 3 is a diagram of a feature pyramid based Vision Transformer refinement model in accordance with the present invention.

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

The invention provides a method for generating a picture subtitle by applying a feature pyramid. The pyramid structure enlarges the dimension of the features, so that the features cannot be fused. The invention scales the picture features to be consistent by arranging an extra encoder layer in the first encoder layer on the basis of Vision Transformer, so as to facilitate the feature fusion.

As shown in fig. 3, the present invention provides a method for generating a subtitle using a feature pyramid, including the following steps:

1) and inputting the preprocessed picture into a feature pyramid module, and extracting features of the picture through a multilayer convolutional neural network. Respectively extracting feature graphs of a low-level convolutional neural network and a top-level convolutional neural network, wherein the feature graphs of the two scales of picture feature information and the preprocessed picture are called as picture features of three different scales of low dimensionality, high dimensionality and original dimensionality;

2) converting the original-level picture features into vector representation through an embedding layer

3) Sending the image characteristics of three different scales into a first layer of an encoder, and carrying out dimension scaling, namely scaling into hidden layer information with the same dimension size through a self-attention mechanism and a feedforward neural network;

4) sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, and obtaining fused picture characteristics by splicing the three encoder characteristics

5) And sending the fused picture characteristics into a decoder of the model for decoding, decoding the picture characteristics into picture subtitles by the decoder through stacked decoder layers, performing gradient updating through a cross entropy loss function, and optimizing the weight of the model.

In the step 1), preprocessing picture data, inputting pictures into a feature pyramid module, and extracting features through a multilayer convolutional neural network. The calculation mode of the convolutional neural network is as follows:

weight(i,j)＝w[:,:,i:j]

input(x,k)＝x[:,:,k]

wherein x ∈ R^H×W×CRepresenting tensor for pictures in a computer, wherein H represents the height of the picture, W represents the width of the picture, C represents the number of channels of the picture, and x represents 2D cross-correlation operation,

represents a convolution kernel wherein c_outNumber of channels representing output characteristics and C_inThe number of channels representing the input characteristics, w [::, ij]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Representing the k-th tensor in the third dimension, taking x.

To input (batch, h, w, C)_in) (16, 384, 384, 3) as an example, there are multiple layers of convolutional neural networks in a pyramid structure. Each time, a new picture characteristic diagram (batch, h ', w', C) is obtained through convolution calculation_out) (16, 192, 192, 16), C of each layer of convolutional network_outAre becoming larger. And taking the output of the first layer of convolutional neural network as a low-dimensional picture characteristic, and taking the output of the last layer of convolutional neural network as a high-dimensional picture characteristic. Wherein h ', w' become smaller as the number of layers increases. And the original picture is referred to as an original level picture feature.

In step 2), the height and width of the original-level picture features are adjusted to a specified size and then divided into sub-pictures with fixed sizes, each sub-picture is called a patch, and then the sub-pictures are sent to an embedding layer to obtain a code for each patch, namely a picture embedding vector, specifically:

301) taking the input (batch, c, h, w) and the resolution of each batch as p × p as an example, the specific process of the process of dividing into sub-pictures is as follows: firstly, each original-stage picture feature is cut into (h/p) ((w/p)) small blocks from (batch, c, p) ((h/p), p) ((w/p)) to (batch, c), (h/p) ((w/p), p) () p), then the small blocks are converted into (batch), (h/p) ((w/p), p c), which is equivalent to the division into (h/p) ((w/p) batches, and the dimension of each batch is p × p; this process can be implemented by the following calculation:

x＝rearrange(img,′b c(h p1)(w p2)→b(h w)(p1 p2 c)′,p1＝p,p2＝p)

wherein, the rearrarage function is an operator of the einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and width of the image respectively.

302) And after the sub-pictures are divided, obtaining an embedded vector of the original-level picture characteristics, and adjusting the dimensionality of the embedded vector, namely adjusting the dimensionality to a required size through a full-connection layer.

In step 3), the picture features with different dimensions are sent to the first layer of the encoder, and the first layer of the encoder is composed of three coding layers with different dimensions and respectively corresponds to the three picture features. Wherein the coding layer is composed of a self-attention mechanism and a feedforward neural network. The calculation method of the multi-head self-attention mechanism is as follows:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein Q, K, V is the input vector, head, of the model_iThe ith head vector, W is the translation model parameter, Attention (·) is the Attention mechanism function, Concat: (A) is a vector join function;

the feedforward neural network layer FNN is calculated as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

401) original picture features are characterized by (batch, l)₀，d₀) For example, the low-dimensional picture features are (batch, h)₁，w₁，d₁) For example, the high-dimensional picture features are (batch, h)₂，w₂，d₂) For example, the feature of the low-dimensional and high-dimensional pictures is subjected to dimension transformation to obtain (batch, h)₁*w₁，d₁) And (batch, h)₂*w₂，d₂). Sending the three picture characteristics to a multi-head self-attention machine to obtain₀D) and (batch, l)₁D) and (batch, l)₂And d) picture features with consistent dimensions.

402) And then, the picture characteristics are sent into a feedforward neural network, and hidden layer information with the same third dimension is still obtained.

In the step 4), the hidden layer information with consistent dimension is sent to a high layer stacked in the encoder to obtain three encoder characteristics. And finally, splicing the three encoder features to obtain a fused image feature with a longer length.

And 5) sending the fused picture characteristics into a decoder, and decoding the picture characteristics into picture subtitles through stacked decoder layers, wherein the decoder layers are composed of a multi-head self-attention mechanism and a feedforward neural network. And the whole neural network carries out gradient updating through a cross entropy loss function, and the weight of the model is optimized.

In order to verify the effectiveness of the method, the invention applies the picture subtitle generating method applying the characteristic pyramid to two tasks of generating the picture subtitle of the MSCOCO 2017. There are about 12 ten thousand pictures in the MSCOCO2017 data set, and each picture corresponds to 5 pieces of caption data. As can be found from Table 1, compared with the original Vision Transformer model, the BLEU value of the caption generated by the model is obviously improved, and the method provided by the invention is proved to be capable of effectively improving the picture caption generation quality.

Model (model)	MSCOCO2017
		Original Vision Transformer	37.9
Feature pyramid Vision Transformer	39.6

According to the invention, before the picture features are sent to the encoder, information extraction with different scales is carried out through the feature pyramid, and the semantic expression capability of the picture is enhanced from different angles and different visual field distances of the picture, so that the information contained in the picture is more effectively extracted. Meanwhile, the invention adopts the encoder capable of sharing parameters, thereby effectively reducing the calculation cost of a self-attention mechanism and a feedforward neural network in the encoder.

Claims

1. A method for generating a picture subtitle by applying a feature pyramid is characterized by comprising the following steps of:

1) inputting the preprocessed picture into a feature pyramid module, extracting features of the picture through a multilayer convolutional neural network in the feature pyramid module, and respectively extracting feature graphs of a lower-layer convolutional neural network and a top-layer convolutional neural network to serve as picture feature information of two scales of a low dimension and a high dimension and picture features of three different scales of the preprocessed picture, namely the picture features of the low dimension, the high dimension and the original dimension;

4) hidden layer information with consistent dimensions is sent to a high layer stacked in an encoder to obtain three encoder characteristics, and fused picture characteristics are obtained through splicing operation;

2. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: in the step 1), preprocessing image data, inputting the image into a characteristic pyramid module, and performing characteristic extraction through a multilayer convolutional neural network, wherein the convolutional neural network is calculated in the following way:

weight(i,j)＝w[:,:,i:j]

input(x,k)＝x[:,:,k]

x is equal to R in formula^H×W×CExpressing tensor of the pictures in a computer, H expressing the height of the pictures, W expressing the width of the pictures, C expressing the number of channels of the pictures, and expressing 2D cross-correlation operation in a formula

Represents a convolution kernel of where c_outNumber of channels representing output characteristics, C_inThe number of channels, w [:, ij, representing the input characteristics]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Expressing a kth tensor in a third dimension of x, wherein b in the formula expresses a bias constant; weight (i, j) represents the j-th convolution kernel of channel i in the convolutional neural network, input (x, k) represents the tensor of the k-th channel of input x,

3. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: step 2) sending the original dimension picture characteristics to an embedding layer to be converted into vector representation, namely, adjusting the height and width of the original dimension picture characteristics to a specified size, dividing the original dimension picture characteristics into sub-pictures with fixed sizes, wherein each sub-picture is called a patch, and then sending the sub-pictures to the embedding layer to obtain the code of each patch, namely, a picture embedding vector, and specifically, the method comprises the following steps:

x＝rearrange(b,c,(h*p1),(w*p2)→b,(h*w),(p1*p2*c)′)

wherein, the rearrarage function is an operator of an einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and the width of the image respectively;

4. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: in step 3), the picture features with different dimensions are sent to a first layer of an encoder, and the first layer of the encoder is composed of three coding layers with different dimensions, which respectively correspond to the three picture features, wherein the coding layers are composed of an attention mechanism and a feedforward neural network, and the calculation mode of the multi-head attention mechanism is as follows:

head_i＝Attention(QW_i ^Q,KW_I ^K,VW_I ^V)

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein Q, K, V is the input vector, head, of the model_iFor the ith head vector, W is the translation model parameter, Attention (-) is the Attention mechanism function, and Concat (-) is the vector connection function;

the calculation of the feedforward neural network layer FNN is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂