CN113535999A

CN113535999A - Diversified image description sentence generation technology based on deep learning

Info

Publication number: CN113535999A
Application number: CN202110758735.0A
Authority: CN
Inventors: 任磊; 孟子豪; 王涛
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-10-22
Anticipated expiration: 2041-07-05
Also published as: CN113535999B

Abstract

The invention discloses a diversified image description sentence generation technology based on deep learning, belonging to the technical field of image description sentence generation; the diversified image descriptive sentence generating technology based on deep learning is used for solving the problems that a generated sentence is single and image details are widely ignored in the traditional image descriptive sentence generating technology, and is suitable for generating an image into diversified descriptive sentences.

Description

Diversified image description sentence generation technology based on deep learning

The technical field is as follows:

the invention relates to a technology for generating diversified image description sentences based on deep learning, and belongs to the technical field of image description sentence generation.

Background art:

currently, the image description sentence generation technology based on deep learning is generally realized by constructing an "encoder-decoder" model. The encoder is used for converting a digital matrix of an image into high-dimensional feature coding rich in semantic information, and is specifically realized by adopting a residual error model based on a convolutional neural network; while the "decoder" is used to decode the above-mentioned high-dimensional features and input the semantic information thereof into a text generation model to obtain descriptive statements, the general decoder adopts two structures, namely a long-short-time neural network-based and Transformer-based (Transformer) -based self-attention mechanism for text generation. While the model penalty is typically based on multiple reference statements, using cross entropy as a penalty function. However, when the model is trained by the method, the model tends to generate a broader statement description when actually generating statements, for example, for two similar scenes, images with different details, the model tends to generate a more general and broader statement, which not only ignores the details in the images, but also cannot meet the requirement of generating diversified description statements.

Therefore, in view of the above problems, the present invention has devised a diversified image description sentence generation technique based on deep learning.

The invention content is as follows:

the invention aims to provide a diversified image descriptive sentence generating technology based on deep learning, which is used for solving the problems that the generated sentence is single and the image details are widely ignored in the traditional image descriptive sentence generating technology and is suitable for generating images into diversified descriptive sentences.

Convolutional neural networks are widely used in various tasks related to computer vision. The deep convolutional neural network performs feature extraction of an image, and is generally implemented by a convolutional Layer (Convolution Layer) and a Pooling Layer (Pooling Layer).

The pictures are usually stored in the form of a digital matrix in the computer, and each element in the matrix represents the content information of the corresponding position of the picture. The number of matrices and the relationship between the same location values of different matrices depend on the color type of the picture. For the digital matrix, the convolutional neural network firstly sets a plurality of convolutional kernels (filters) on each convolutional layer, detects specific characteristics, and maps an original input picture to a high-dimensional characteristic space to form an output matrix.

The output matrix of the convolutional layer will be handed to the pooling layer for a "downsampling" operation, thereby reducing the matrix size. The pooling layer performs region division on the convolution layer output matrix, then uses a nonlinear pooling function to extract the relative positions of different characteristics in the convolution product matrix, and splices the outputs together to form pooling layer output.

Generally, the convolutional layer and the pooling layer are bound and reused, and an originally large numerical matrix of an input picture is continuously reduced, so that the effect of extracting specific features is achieved. However, as the depth of the network increases, the training performance of the network is limited due to the problems of gradient disappearance, gradient explosion and the like. To address this problem, a deep residual network is proposed. The deep residual network also adopts a stack of a plurality of convolution-pooling layers, but a Skip Connection (Skip Connection) structure is arranged in each subunit so as to solve the problem of gradient disappearance caused by the increase of the network depth.

In the invention, a ResNet-101 model is used for extracting the characteristics of the image, and the model belongs to a standard size specification of a residual error network. As shown in fig. 1, the image file is preprocessed and then forms an image high-dimensional semantic feature code through a deep convolutional neural network.

The image description statement generation model based on the self-attention mechanism is also based on an encoder and a decoder, and the overall structure is shown in FIG. 4. The whole model is trained in a supervised learning mode and can be divided into two parts according to different inputs. The first is an encoder taking the high-dimensional semantic features of the image extracted by the deep convolutional neural network as input, and the second is a decoder taking the embedded sequence of the image description sentences written artificially as input.

On one hand, the high-dimensional semantic features of the image are directly input into the first coding block in the encoder, and the structure of the coding block is shown in fig. 2. The output of each coding block is the input of the next layer, and the output is given by the sixth coding block.

On the other hand, an image description sentence written manually is input to a decoder after word embedding and position embedding. The invention has the following detailed design and key points: compared with the traditional Decoder part, the Decoder designed by the method not only increases a coder-Decoder Multi-Head Attention (Encoder-Decoder Multi-Head Attention) relative to the coder, but also increases a style parameter matrix for inputting different style image description sentences.

The technical scheme adopted by the invention is as follows: a diversified image description sentence generation technology based on deep learning is characterized in that: comprising a decoding block comprising: the Multi-Head Attention of the Multi-trellis Decoder, the Multi-Head Attention of the Multi-trellis Encoder Decoder and the feedforward neural network are added, and compared with an Encoder, the Multi-Head Attention (Encoder-Decoder Multi-Head Attention) of the Multi-trellis Encoder Decoder is added, and a style parameter matrix for the input of image description sentences of different styles is added; the input sequence of each layer of decoding blocks first goes into the decoding block multi-headed attention as shown in fig. 3 and performs the self-attention mechanism and the calculation of the "add and normalize layers". The calculated output is input into the multi-head attention of the newly added multi-format encoder decoder together with the output of the top-level encoder;

Multi_Head_(D，E，E)＝concat(head₁，...，head_d) (1)

head_i＝Attention(D·W_iD，E·W_iE，E·W_i ^E) (2)

wherein, E is the output from the top encoder, and D is the result of the addition and normalization layer of the output given by the decoder of this unit. W_SIs a style parameter matrix, and may include WT true style matrix WR, romantic style matrix WH, humorous style matrix, as shown in fig. 3. In each decoding block, three style parameter matrixes of true, romantic and humorous can be set. The three matrices are initialized randomly and the overall model is trained using a multi-style image description dataset. The image descriptions of different style types are trained by using different style parameter matrixes, other parameters are designed to model general fact descriptions in the text data and are shared in different style types, and in the training process, in order to keep the difference of multi-style output of the model, the multi-style parameter matrix of the model must maximize the Euclidean distance between the matrixes so as to realize the purposeThe images describe the difference and uniqueness between the outputs.

The formula of fig. 2 describes:

in each layer of the coding block, the input of the layer is assumed to be X, W^OIs a multi-headed attention fusion matrix. W_i ^Q，W_i ^K，W_i ^VRespectively multi-headed attention weight matrices. LayerNormalization is layer normalization. Y is₁Output for the first summing and normalization layer. Y is₂Is the output of the feedforward neural network. Y is₃Output for the second summing and normalization layer. W₁And b₁Are model parameters.

hend_i＝Attention(x·W_i ^Q，x·W_i ^K，x·W_i ^V)#2

Y₂＝max(0，Y₁W₁+b₁)W₂+b₂)#5

Y₃＝LayerNormalization(Y₂+Y₁)#6

Coding a first part: equations 1, 2 and 3 are encoder multi-head self-attention

Coding a second part: equation 4 is the first addition and normalization

Coding block third part: equation 5 is a feedforward neural network

And a fourth part of the coding block: equation 6 is a second addition and normalization

The formula of FIG. 3 describes:

decoding blockIn each layer, assume that the input of this layer is x, W^OIs a multi-headed attention fusion matrix. W_i ^Q，W_i ^K，W_i ^VRespectively multi-headed attention weight matrices. Let E be the input to the layer six coding block. W_sIs a style parameter matrix comprising W_TTrue style matrix, W_RRomantic style matrix, W_HHumorous style matrix. D is the first addition and normalization layer output, Y₁For the second addition and normalization layer output, Y₂Is the output of the feedforward neural network. Y is₃Output for the third summing and normalization layer. W₁And b₁As a parameter of the model

head_i＝Attention(X·W_i ^Q，X·W_i ^K，X·W_i ^V)#8

head_i＝Attetion(D·W_i ^Q，E·W_i ^K，E·W_i ^V)#12

Y₂＝max(0，Y₁W₁+b₁)W₂+b₂)#15

Y₃＝LayerNormalization(Y₂+Y₁)#16

Decoding block first part: equations 7, 8 and 9

Decoding block second part: equation 10

Decoding block third part: equations 11,12, and 13

Decoding the fast fourth part: equation 14

Decoding block fifth part: equation 15

Decoding block sixth part: equation 16

The invention has the beneficial effects that: after the traditional image description generation technology based on deep learning is used, the description of the image is changed into three sentence descriptions with reality, romance and humorous due to the broad and general description.

On the model level: compared with the traditional image description, the method has the advantages that three real, romantic and humorous statement descriptions are generated and unified into one model, the three different statement descriptions describe the model network parameters of the shared part, the model is prevented from being biased to be fitted with the statement description of a single style, the learned model is enabled to have generality, and the network parameter utilization rate of the model is effectively improved.

At the aspect of describing and generating effects: the generated description sentence follows the grammar of the natural language and has good readability and grammatical correctness, and the generated image description can accurately describe the details of the image, carry out character description on the image in multiple styles and has description richness.

Description of the drawings:

in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a general flow diagram of an image description generation statement

Fig. 2 is an internal structural diagram of a single encoder block in fig. 4.

Fig. 3 is an internal structural diagram of the multi-trellis decoding block of fig. 4.

Fig. 4 is a diagram of a model structure generated based on an image description statement of the self-attention mechanism.

In the drawings:

the specific implementation mode is as follows:

in order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1:

FIG. 1 is a general flow diagram of an image description generation statement: the method comprises the following specific steps:

1) an image file of the real world is acquired. 2) For an image file, a matrixing is first performed. In the matrix, each element represents content information of a corresponding position of the picture. The number of matrices and the relationship between the same location values of different matrices depend on the color type of the picture. 3) In order to accelerate the convergence speed of the image description generation model, the data mapping between [0-1] and the standardization of the matrixed image description file are carried out. 4) The normalized image matrix is input to a deep convolutional neural network. 5) And extracting the multi-level features of the deep convolutional neural network to obtain the high-dimensional semantic features of the image. 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multilevel encoding; 7) the semantic features of the depth image are input into a multi-format decoder, the Euclidean distance between parameter matrixes is maximized to ensure the difference of multi-format output of the model through the designed multi-format parameter matrixes, and other parameters except the parameter matrixes are shared to model general fact description in text data. 8) The effect of generating the multi-format image description is achieved by changing the parameter matrix of the multi-format decoder.

Fig. 1 is divided into 8 parts in total. The encoder of the sixth part of fig. 1 consists of the encoding blocks 1-6 of fig. 4. Whereas the multi-trellis decoder of the seventh part of fig. 1 is composed of the multi-trellis decoding blocks 1-6 of fig. 4.

The internal structure of each coding block is composed of four parts shown in figure 2, including self attention of the encoder multi-head, addition and normalization, feed-forward neural network and addition and normalization.

The internal structure of each decoding block is composed of six parts, namely multi-grid decoder multi-head self-attention, addition and normalization, multi-grid encoder decoder multi-head self-attention, addition and normalization, feed-forward neural network and addition and normalization, which are shown in figure 3. Wherein, the multi-head self-attention part of the decoding block comprises three parameter matrixes of different styles.

Generally speaking, 6 or 12 coding blocks can be taken, and the stacked deep coding blocks and decoding blocks are beneficial for the model to extract deep semantic features of richer images and texts.

FIG. 2: fig. 2 is an internal structural view of parts 5 to 7 in fig. 1.

The method comprises the following steps: 1) the high-dimensional semantic features of the image are coded by the coding blocks 1-6 to obtain more abstract depth image semantic features. These depth semantic features would be input as the "multi-trellis encoder decoder multi-headed self-attention" part of the multi-trellis decoding block. 2) During the training of fig. 4, since the multi-style image description sentences cannot be directly input into the model, the multi-style image description sentences need to be respectively embedded by words and converted into dense feature vectors. And because the model cannot directly learn the sequential relation of the input multi-format image description sentences, a position embedding module is added. Finally, the word embedding and the position embedding of the multi-style image description sentence are added, and finally input to the multi-style decoder. 3) And finally, obtaining the predicted output of the multi-format image description statement through the stacked multi-format decoding blocks, the mapping of the linear layer and the soft maximum layer.

The foregoing is directed to embodiments of the present invention, and the foregoing description is not intended to limit the scope of the invention, which is defined by the claims appended hereto.

Claims

1. A diversified image description sentence generation method based on deep learning is characterized in that: 1) acquiring an image file of a real world; 2) for an image file, firstly, matrixing is performed, and in the matrix, each element represents content information of a corresponding position of a picture. The number of the matrixes and the relationship between the same position values of different matrixes are specifically determined by the color type of the picture; 3) in order to accelerate the convergence speed of the image description generation model, mapping data between [0-1] and standardizing the matrixed image description file; 4) inputting the standardized image matrix into a depth convolution neural network; 5) extracting multi-level features of a deep convolutional neural network to obtain high-dimensional semantic features of the image; 6) inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multilevel encoding; 7) the semantic features of the depth image are input into a multi-format decoder, the Euclidean distance between the parameter matrixes is maximized to ensure the difference of multi-format output of the model through the designed multi-format parameter matrixes, and other parameters except the parameter matrixes are shared to model general fact description in the text data; 8) the effect of generating the multi-format image description is achieved by changing the parameter matrix of the multi-format decoder.

2. The method for generating diversified image description sentences based on deep learning of claim 1, wherein: 6 or 12 coding blocks can be selected, and the stacked deep coding blocks and decoding blocks are beneficial to extracting deep semantic features of richer images and texts by a model; the decoding block comprises six parts, namely a multi-grid decoder multi-head self-attention part, an addition part and a normalization part, a multi-grid encoder decoder multi-head self-attention part, an addition part and a normalization part, a feedforward neural network part and an addition part and a normalization part; wherein, the multi-head self-attention part of the decoding block comprises three parameter matrixes of different styles.