CN113535999B

CN113535999B - Diversified image description sentence generation technology based on deep learning

Info

Publication number: CN113535999B
Application number: CN202110758735.0A
Authority: CN
Inventors: 任磊; 孟子豪; 王涛
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2023-05-26
Anticipated expiration: 2041-07-05
Also published as: CN113535999A

Abstract

The invention discloses a diversified image description sentence generation technology based on deep learning, belonging to the technical field of image description sentence generation; the diversified image description sentence generation technology based on the deep learning is used for improving the problems that the generated sentences are single, and the image details are widely ignored in the traditional image description sentence generation technology, and is suitable for generating the image into diversified description sentences.

Description

Diversified image description sentence generation technology based on deep learning

Technical field:

the invention relates to a diversified image description sentence generation technology based on deep learning, and belongs to the technical field of image description sentence generation.

The background technology is as follows:

currently, image description sentence generation techniques based on deep learning are generally implemented by constructing an "encoder-decoder" model. The encoder is used for converting a digital matrix of the image into high-dimensional feature codes rich in semantic information, and is realized by adopting a residual error model based on a convolutional neural network; and a "decoder" is used to decode the above-mentioned high-dimensional features and input semantic information therein into a text generation model to obtain descriptive sentences, and generally the decoder adopts two structures of a long-short-time neural network-based and a Transformer-based (Transformer) -based self-attention mechanism for text generation. While the loss of the model is typically based on multiple reference sentences, cross entropy is used as a loss function. However, when the method trains the model, the model tends to generate a wider statement description when the statement is actually generated, for example, in the images with different details for two similar scenes, the model tends to generate a partial summary wider statement, so that the details in the images are ignored, and the requirement of generating various description statements by us cannot be met.

Accordingly, based on the above-mentioned problems, the present invention has devised a diversified image description sentence generation technique based on deep learning.

The invention comprises the following steps:

the invention aims to provide a diversified image description sentence generation technology based on deep learning, which is used for solving the problems that generated sentences are single, image details are widely ignored in the traditional image description sentence generation technology, and is suitable for generating images into diversified description sentences.

Convolutional neural networks are widely used in various tasks related to computer vision. The deep convolutional neural network performs feature extraction of the image, and is typically implemented by a convolutional Layer (Convolution Layer) and a Pooling Layer (Pooling Layer).

The pictures are typically stored in a computer in the form of a digital matrix in which each element represents content information for the corresponding location of the picture. The number of matrices and the relationship of the same position values of different matrices depend on the color type of the picture. For the digital matrix, the convolutional neural network firstly sets a plurality of convolutional kernels (filters) on each convolutional layer, detects specific features, and maps an original input picture to a high-dimensional feature space to form an output matrix.

The output matrix of the convolutional layer will be passed to the pooling layer for a "Subsampling" operation, thereby reducing the matrix size. And the pooling layer performs regional division on the convolution layer output matrix, then uses a nonlinear pooling function to extract the relative positions of different features in the convolution product matrix, and splices the outputs together to form the pooling layer output.

In general, the convolution layer and the pooling layer are bundled and reused, and the original larger numerical matrix of the input picture is continuously reduced, so that the extraction effect of the specific features is realized. However, as the depth of the network increases, the problems of gradient disappearance, gradient explosion, etc. limit the training performance of the network. For this problem, a depth residual network is proposed. The depth residual network also adopts a stack of a plurality of convolution-pooling layers, but a Skip Connection (Skip Connection) structure is arranged at each subunit to solve the problem of gradient disappearance caused by network depth increase.

In the invention, a ResNet-101 model is used for extracting the characteristics of the image, and the model belongs to a standard size specification of a residual error network. As shown in FIG. 1, the image file is preprocessed and then subjected to a deep convolutional neural network to form the high-dimensional semantic feature code of the image.

The self-attention mechanism based image description statement generation model is also based on an encoder and a decoder, and the overall structure is shown in fig. 4. The model is trained in a supervised learning mode, and can be divided into two parts according to different inputs. One is an encoder which takes as input the high-dimensional semantic features of the image extracted by the deep convolutional neural network, and the other is a decoder which takes as input the embedded sequence of the image description statement which is written manually.

On the one hand, the high-dimensional semantic features of the image are directly input into the first coding block in the encoder, and the structure of the coding block is shown in fig. 2. The output of each coding block is the input of the next layer until the sixth coding block gives the output.

On the other hand, the manually written image description sentence is input to the decoder after word embedding and position embedding. The detail design and key of the invention are as follows: compared with the traditional Decoder, the designed Decoder not only increases the Multi-Head Attention (Encoder-Decoder Multi-Head Attention) of an Encoder relative to the Encoder, but also increases a style parameter matrix which is input towards different style image description sentences.

The technical scheme adopted by the invention is as follows: a diversified image description sentence generation technology based on deep learning is characterized in that: comprising a decoding block comprising: multi-style Decoder Multi-Head self-Attention, multi-style Encoder Decoder Multi-Head self-Attention, feedforward neural network, compared with Encoder, increase a Multi-style Encoder Decoder Multi-Head Attention (Encoder-Decoder Multi-Head Attention), and increase a style parameter matrix facing different style image description statement input; the input sequence of each layer of decoding blocks first goes into the decoding block multi-head attention as shown in fig. 3 and performs the self-attention mechanism and the calculation of the "add and normalize layer". The calculated output is input to the newly added multi-trellis encoder decoder multi-head attention along with the output of the top-layer encoder;

Multi_Head _(D，E，E) ＝concat(head ₁ ，...，head _d ) (1)

head _i ＝Attention(D·W _i D，E·W _i E，E·W _i ^E ) (2)

where E is the output from the top encoder and D is the result of the "addition and normalization layer" of the output from the unit decoder. W (W) _S Is a style parameter matrix, and as shown in fig. 3, may include WT real style matrix WR, romantic style matrix, WH, humor style matrix. In each decoding block, three style parameter matrixes of reality, romance and humour can be set. The three matrices begin to initialize randomly and the overall model is trained with multiple styles of image description datasets. Image descriptions of different style types are trained using different matrices of style parameters, other parameters are designed to model generic fact descriptions in text data, and are shared in different style styles, in which the model's multi-style parameter matrices must maximize the Euclidean distance between the matrices to achieve variability and uniqueness between the image description outputs in order to preserve the variability of the model's multi-style outputs.

The formula of fig. 2 describes:

in each layer of the coding block, it is assumed that the input of the layer is X, W ^O Is a multi-headed attention fusion matrix. W (W) _i ^Q ，W _i ^K ，W _i ^V And respectively carrying out multi-head attention weight matrixes. LayerNormalization is layer normalization. Y is Y ₁ The layer output is normalized for the first sum. Y is Y ₂ Is output by the feedforward neural network. Y is Y ₃ The layer output is normalized for the second addition. W (W) ₁ And b ₁ Is a model parameter.

hend _i ＝Attention(x·W _i ^Q ，x·W _i ^K ，x·W _i ^V )#2

Y ₂ ＝max(0，Y ₁ W ₁ +b ₁ )W ₂ +b ₂ )#5

Y ₃ ＝LayerNormalization(Y ₂ +Y ₁ )#6

Encoding block first part: equations 1, 2 and 3 are encoder multi-head self-attention

Encoding block second portion: equation 4 is the first addition and normalization

Coding block third section: formula 5 is a feedforward neural network

Coding block fourth section: equation 6 is the second addition and normalization

The formula of fig. 3 describes:

in each layer of the decoding block, it is assumed that the input of this layer is x, W ^O Is a multi-headed attention fusion matrix. W (W) _i ^Q ，W _i ^K ，W _i ^V And respectively carrying out multi-head attention weight matrixes. Let E be the input of the sixth layer encoded block. W (W) _s Is a style parameter matrix comprising W _T True style matrix, W _R Romantic style matrix, W _H Humor style matrix. D is the first addition and normalization layer output, Y ₁ For the second addition and normalization layer output, Y ₂ Is output by the feedforward neural network. Y is Y ₃ The layer output is normalized for the third summation. W (W) ₁ And b ₁ Is a model parameter

head _i ＝Attention(X·W _i ^Q ，X·W _i ^K ，X·W _i ^V )#8

head _i ＝Attetion(D·W _i ^Q ，E·W _i ^K ，E·W _i ^V )#12

Y ₂ ＝max(0，Y ₁ W ₁ +b ₁ )W ₂ +b ₂ )#15

Y ₃ ＝LayerNormalization(Y ₂ +Y ₁ )#16

Decoding block first part: equations 7, 8 and 9

Decoding block second part: equation 10

Decoding block third section: formulas 11,12, and 13

Decoding fast fourth part: equation 14

Decoding block fifth section: equation 15

Decoding block sixth section: equation 16

The invention has the beneficial effects that: after the use of the technical field of the conventional image description generation based on the deep learning, the description of the image becomes three statement descriptions with reality, romance and humour due to the broad generalization of the description.

At the model level: compared with the traditional image description, three statement descriptions of reality, romance and humorous are generated and unified into one model, and three different statement descriptions share model network parameters of a part, so that the model is prevented from being biased to be fitted to the statement description of a single style, the learned model is more general, and the network parameter utilization rate of the model is effectively improved.

At the description generation effect level: the generated description sentences follow the grammar and good readability of natural language, have grammar correctness, and the generated image description can accurately describe the details of the image, carry out text description on the image in multiple styles, and have description richness.

Description of the drawings:

in order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a general flow chart of an image description generation statement

Fig. 2 is an internal structural diagram of a single encoder block of fig. 4.

Fig. 3 is an internal structural diagram of the multi-trellis decoding block of fig. 4.

Fig. 4 is a diagram of a model structure generated based on image description statements of a self-attention mechanism.

In the drawings:

the specific embodiment is as follows:

in order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1:

FIG. 1 is a general flow chart of an image description generation statement: the method comprises the following specific steps:

1) A real world image file is acquired. 2) For an image file, matrixing is first performed. In the matrix, each element represents the content information of the corresponding position of the picture. The number of matrices and the relationship of the same position values of different matrices depend on the color type of the picture. 3) In order to accelerate the convergence speed of the image description generation model, the matrixed image description file is subjected to data mapping between [0-1] and standardization. 4) The normalized image matrix is input into a deep convolutional neural network. 5) And obtaining the high-dimensional semantic features of the image through multi-level feature extraction of the deep convolutional neural network. 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multi-level encoding; 7) The depth image semantic features are input to a multi-style decoder, the Euclidean distance between parameter matrices is maximized through the designed multi-style parameter matrices to ensure the difference of multi-style output of the model, and the rest parameters except the parameter matrices are shared so as to model general fact description in text data. 8) The effect of generating the multi-style image description is achieved by changing the parameter matrix of the multi-style decoder.

Fig. 1 is divided into 8 parts in total. The encoder of the sixth part in fig. 1 consists of the encoding blocks 1-6 in fig. 4. And the multi-trellis decoder of the seventh part in fig. 1 is composed of multi-trellis decoding blocks 1-6 in fig. 4.

The internal structure for each coding block consists of four parts, shown in fig. 2, including encoder multi-head self-attention, summation and normalization, feed-forward neural network, summation and normalization.

For the internal structure of each decoding block, the method is composed of six parts including multi-style decoder multi-head self-attention, addition and normalization, multi-style encoder decoder multi-head self-attention, addition and normalization, feedforward neural network and addition and normalization, as shown in fig. 3. Wherein the multi-headed self-attention portion of the decoded block contains parameter matrices of three different styles.

Typically, 6 or 12 coding blocks can be taken, and stacked deep coding blocks and decoding blocks facilitate deep semantic features of the model extraction of richer images and text.

Fig. 2: fig. 2 is an internal structural view of the 5 th to 7 th portions of fig. 1.

Comprising the following steps: 1) The high-dimensional semantic features of the image are obtained into more abstract depth image semantic features through the coding block 1 and the coding block 6. These depth semantic features may be input as a "multi-style encoder decoder multi-headed self-attention" portion of the multi-style decoding block. 2) In the training of fig. 4, since the multi-style image description sentence cannot be directly input into the model, it needs to be converted into a dense feature vector through word embedding. And because the model can not directly learn the sequence relation of the input multi-style image description sentences, a position embedding module is added. Finally, word embedding and position embedding of the multi-style image description sentence are added, and finally, a multi-style decoder is input. 3) And finally, obtaining the prediction output of the multi-style image description sentence through the laminated multi-style decoding block, the mapping of the linear layer and the soft maximum layer.

The foregoing description is merely illustrative of specific embodiments of this invention, and is not intended to limit the spirit of the invention, since modifications and variations of the specific embodiments described above will become apparent to those skilled in the art in light of the disclosure herein, without departing from the spirit and scope of the invention.

Claims

1. A diversified image description sentence generation method based on deep learning is characterized in that: 1) Acquiring an image file of a real world; 2) For an image file, firstly, matrixing is carried out, and each element in the matrix represents content information of a corresponding position of a picture; the number of the matrixes and the relation of the numerical values of the same positions of different matrixes depend on the color type of the picture; 3) In order to accelerate the convergence speed of the image description generation model, mapping data between [0-1] and standardization are carried out on the matrixed image description file; 4) The standardized image matrix is input into a deep convolutional neural network; 5) The high-dimensional semantic features of the image are obtained through multi-level feature extraction of the deep convolutional neural network; 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multi-level encoding; 7) The depth image semantic features are input to a multi-style decoder, the Euclidean distance between parameter matrixes is maximized through the designed multi-style parameter matrixes to ensure the difference of multi-style output of the model, and other parameters except the parameter matrixes are shared so as to model general fact description in text data; 8) The effect of generating the multi-style image description is achieved by changing the parameter matrix of the multi-style decoder.

2. The deep learning-based diversified image description sentence generation method as claimed in claim 1, characterized by comprising the steps of: 6 or 12 coding blocks can be taken, and the stacked deep coding blocks and decoding blocks are beneficial to the model to extract deep semantic features of richer images and texts; the decoding block comprises six parts of multi-style decoder multi-head self-attention, addition and normalization, multi-style encoder decoder multi-head self-attention, addition and normalization, feedforward neural network and addition and normalization; wherein the multi-headed self-attention portion of the decoded block contains parameter matrices of three different styles.