CN116665012B

CN116665012B - Automatic generation method and device for image captions and storage medium

Info

Publication number: CN116665012B
Application number: CN202310680080.9A
Authority: CN
Inventors: 孙俊; 高增
Original assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Current assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2024-02-09
Anticipated expiration: 2043-06-09
Also published as: CN116665012A

Abstract

The invention relates to the technical field of natural language processing, and particularly discloses an automatic image subtitle generating method, an automatic image subtitle generating device and a computer storage medium, wherein the automatic image subtitle generating method comprises the following steps: acquiring a subtitle image to be generated, and processing the subtitle image to be generated to obtain vector image characteristics; inputting the vector image features to an encoder for prior knowledge construction and obtaining effective image features; inputting the effective image features to a decoder so as to enable multi-mode interaction between the effective image features and the image description text to be carried out, and obtaining an interaction result; generating a text sequence according to the interaction result; and converting the text sequence to obtain an image caption, and outputting the image caption. The automatic generation method of the image captions can reduce the deviation between the image captions and the actual content expression of the images.

Description

Automatic generation method and device for image captions and storage medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to an automatic image subtitle generating method, an automatic image subtitle generating device, and a computer storage medium.

Background

The simple understanding of image captions is to describe the content of an image in a sentence, which is a task of describing the vision of an image in natural language. Whereas the prior art is typically done using recurrent neural networks, modeling is done using RNN or LSTM build language order.

However, the model modeled in the mode in the prior art is easy to have noise interference in the process of extracting image content and generating characters, so that the generated caption has larger deviation from the actual image content expression. In addition, the language model of the prior art is not suitable for subtitle generation, and the deviation is more obvious.

Therefore, how to reduce the deviation between the subtitle and the actual content representation of the image is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides an automatic image caption generating method, an automatic image caption generating device and a computer storage medium, which solve the problem of large deviation between the image caption and the actual content expression of the image in the related technology.

As a first aspect of the present invention, there is provided an image subtitle automatic generation method, including:

acquiring a subtitle image to be generated, and processing the subtitle image to be generated to obtain vector image characteristics;

inputting the vector image features to an encoder for prior knowledge construction and obtaining effective image features;

inputting the effective image features to a decoder so as to enable multi-mode interaction between the effective image features and the image description text to be carried out, and obtaining an interaction result;

generating a text sequence according to the interaction result;

and converting the text sequence to obtain an image caption, and outputting the image caption.

Further, processing the subtitle image to be generated to obtain a vector image includes:

vector processing and preliminary feature extraction are carried out on the subtitle image to be generated, and preliminary image features are obtained;

and carrying out target feature detection processing on the preliminary image features to obtain vector image features.

Further, inputting the vector image features to an encoder for a priori knowledge building and obtaining image valid features, comprising:

performing first feedforward full-connection layer processing on the vector image characteristics to obtain a first processing result;

carrying out priori knowledge construction according to the first processing result to obtain a priori knowledge construction result;

constructing a multi-head attention mechanism according to the prior knowledge construction result to obtain a multi-head attention mechanism construction result;

and carrying out second feedforward full-connection layer processing on the construction result of the multi-head attention mechanism to obtain effective image characteristics.

Further, performing prior knowledge construction according to the first processing result to obtain a prior knowledge construction result, including:

dividing the first processing result into a retrieval feature, a value feature and a key feature, and carrying out standardized processing on the divided retrieval feature, the divided value feature and the divided key feature to obtain a standardized processing result, wherein the retrieval feature, the divided value feature and the divided key feature are identical;

generating first priori knowledge, and carrying out cross processing on the search features and key features after the standardized processing and the first priori knowledge to obtain attention features;

and generating second priori knowledge, and carrying out cross processing on the normalized value characteristics, the normalized attention characteristics and the second priori knowledge to obtain a priori knowledge construction result.

Further, constructing a multi-head attention mechanism according to the prior knowledge construction result to obtain a multi-head attention mechanism construction result, including:

and carrying out multidimensional vector processing on the priori knowledge construction result to highlight effective features of the image, and obtaining a multi-head attention mechanism construction result.

Further, inputting the image effective feature to a decoder to enable multi-modal interaction between the image effective feature and the image description text to obtain an interaction result, including:

acquiring an image description text;

vectorizing the image description text to obtain an image description text vector;

inputting the image description text vector to a decoder for multi-head attention mechanism processing to obtain an input sequence vector;

and carrying out cross connection processing on the input sequence vector and the effective image characteristics input to the decoder to obtain an interaction result.

Further, inputting the image description text vector to a decoder for multi-head attention mechanism processing to obtain an input sequence vector, including:

dividing the image description text vector into a query vector, a key vector and a value vector, wherein the query vector, the key vector and the value vector are identical;

performing cross processing on the query vector and the key vector to obtain a cross processing result;

and performing cross processing according to the value vector and the weighted result to obtain an input sequence vector.

Further, generating a text sequence according to the interaction result comprises the following steps:

performing gating unit processing on the interaction result to obtain a gating unit processing result;

and carrying out feedforward full-connection layer processing on the processing result of the gating unit to obtain a text sequence.

As another aspect of the present invention, there is provided an image caption automatic generation device for implementing the image caption automatic generation method described above, wherein the image caption automatic generation device includes:

the acquisition module is used for acquiring a subtitle image to be generated and processing the subtitle image to be generated to obtain vector image characteristics;

the construction module is used for inputting the vector image characteristics to the encoder for construction of priori knowledge and obtaining image effective characteristics;

the interaction module is used for inputting the image effective features to a decoder so as to enable multi-mode interaction between the image effective features and the image description text to be carried out, and an interaction result is obtained;

the generation module is used for generating a text sequence according to the interaction result;

and the output module is used for converting the text sequence to obtain an image caption and outputting the image caption.

As another aspect of the present invention, there is provided a computer storage medium storing at least one program instruction that is loaded and executed by a processor to implement the aforementioned image subtitle automatic generation method.

According to the automatic generation method of the image captions, the caption images to be generated are processed and converted into the vector image features, the priori knowledge construction is carried out in the encoder to obtain the image effective features, then the multi-mode interactive training is carried out on the image effective features and the image description text in the decoder, the image captions are obtained according to the final interactive result, the image can be deeply related to be expressed due to the fact that the priori knowledge construction is carried out on the vector image features, further more accurate image effective features are obtained, the interactive training is carried out on the image effective features and the image description text for training in the subsequent decoder, and the accuracy and the diversity of caption generation can be effectively improved. Therefore, the automatic generation method of the image captions provided by the embodiment of the invention can be used for deeply understanding the image content and extracting rich semantic information so as to generate more accurate and rich caption descriptions.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain, without limitation, the invention.

Fig. 1 is a flowchart of an automatic image subtitle generating method according to the present invention.

Fig. 2 is an overall frame diagram for automatically generating image subtitles according to the present invention.

Fig. 3 is a flowchart for obtaining effective features of an image according to the present invention.

FIG. 4 is a prior knowledge building flow chart provided by the present invention.

Fig. 5 is a prior knowledge model diagram provided by the present invention.

FIG. 6 is a diagram of images, text, and cross-attention models between images and text provided by the present invention.

Fig. 7 is a flowchart of obtaining an interaction result provided by the present invention.

Fig. 8 is a diagram of a comparison result between the automatic subtitle generating effect diagram of the present invention and the subtitle generating effect diagram of the prior art.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, there is provided an automatic image subtitle generating method, fig. 1 is a flowchart of the automatic image subtitle generating method provided according to an embodiment of the present invention, as shown in fig. 1, including:

s100, acquiring a subtitle image to be generated, and processing the subtitle image to be generated to obtain vector image characteristics;

in the embodiment of the invention, vectorization processing is carried out on the subtitle image to be generated so as to obtain a group of vector image characteristics with specific dimensions.

S200, inputting the vector image features to an encoder for priori knowledge construction, and obtaining image effective features;

specifically, the embodiment of the invention takes a three-layer encoder as an example, namely, three-layer image encoders are constructed, and each layer of image encoder performs the same processing procedure, namely, all the image encoders perform prior knowledge construction so as to obtain the effective image characteristics of the target area.

As shown in fig. 2, a schematic diagram of a three-layer encoder according to an embodiment of the present invention is shown, where each layer encoder performs a priori knowledge construction, and each layer encoder outputs effective features of an image.

S300, inputting the image effective features to a decoder so as to enable multi-mode interaction between the image effective features and an image description text to be carried out, and obtaining an interaction result;

in the embodiment of the invention, taking a three-layer decoder as an example, namely, constructing the three-layer decoder, and carrying out the same processing on each layer of decoder, namely, carrying out multi-mode interaction processing on the effective image characteristics and the image description text to obtain an interaction result. Each layer of decoder outputs a text sequence to the next layer of encoder, and after the last layer of decoder outputs the text sequence, the text sequence is converted to obtain the image caption.

Fig. 2 is a schematic diagram of a three-layer decoder according to an embodiment of the invention. Each layer of decoder performs interactive processing of the image description text and the effective characteristics of the image.

S400, generating a text sequence according to the interaction result;

in the embodiment of the invention, a text sequence is generated according to the interactive training result of the image description text and the image effective characteristics. The text sequence is still in vector form.

S500, converting the text sequence to obtain an image caption, and outputting the image caption.

And converting the text sequence in the vector form to obtain an image caption and outputting the image caption.

According to the image subtitle automatic generation method provided by the embodiment of the invention, the subtitle image to be generated is processed and converted into the vector image characteristic, the priori knowledge construction is carried out in the encoder so as to obtain the image effective characteristic, then the image effective characteristic and the image description text in the decoder are subjected to multi-mode interactive training, and the image subtitle is obtained according to the final interactive result. Therefore, the automatic generation method of the image captions provided by the embodiment of the invention can be used for deeply understanding the image content and extracting rich semantic information so as to generate more accurate and rich caption descriptions.

Specifically, processing the subtitle image to be generated to obtain a vector image includes:

For example, after a caption image to be generated is processed by a fast R-CNN network, 2048-dimensional preliminary image features of the image are extracted, and feature region detection extraction is performed on the 2048-dimensional preliminary image features to obtain a set of feature regionsWherein->Here->I.e. vector image features, the next about +.>Input to the encoder for a priori knowledge building.

In an embodiment of the present invention, in order to obtain a reduced deviation between the subtitle and the actual representation content of the image, as shown in fig. 3, the vector image features are input to an encoder to perform a priori knowledge construction, and obtain effective features of the image, including:

s210, performing first feedforward full-connection layer processing on the vector image characteristics to obtain a first processing result;

in the embodiment of the invention, the vector image is characterizedThe Feed-Forward feedforward full-connection layer is processed, and the specific processing content comprises: linear layer (FFN), dropout layer, layerNorm Layer (LN) treatment. The Dropout layer probability is set to 0.1, and the other formulas are as follows:

；

。

wherein,representation->Activating a function calculation formula->Variable data representing LayerNorm layer treated +.>And->All represent weight matrix, ">And->All represent bias vectors, ">Representing inputVariable, i.e. data after the attention mechanism processing, < >>Representing the input variable +.>Mean value in each dimension, +.>Representing the input variable +.>Variance of->Representing a small data to prevent zero denominator +.>All representing the initialized parameter tensor.

It should be appreciated that the feedforward full-connection layer processing can enable the data to be converged within a reasonable range, and the fitting degree is increased to enhance the model capability.

S220, carrying out priori knowledge construction according to the first processing result to obtain a priori knowledge construction result;

in the embodiment of the present invention, as shown in fig. 4, constructing for a priori knowledge may specifically include: s221, dividing the first processing result into a retrieval feature, a value feature and a key feature, and carrying out standardized processing on the divided retrieval feature, the divided value feature and the divided key feature to obtain a standardized processing result, wherein the retrieval feature, the divided value feature and the divided key feature are identical;

specifically, after the Feed-Forward processing, the prior knowledge model diagram is constructed for the processed feature vector, as shown in fig. 5. The vector image feature X is divided into a search feature Query, a Key feature Key and a Value feature Value, and all the information of the three features are the same and come from the vector image feature X.

S222, generating first priori knowledge, and carrying out cross processing on the search features and key features after the standardized processing and the first priori knowledge to obtain attention features;

after the three features are standardized, the information slot is established by encoding priori knowledge. It should be appreciated here that the establishment of a priori knowledge is not dependent on the vector image feature X. Instead, a vector which can be independently learned is constructed, and then the vector is calculated with the normalized vector image characteristic X.

S223, generating second priori knowledge, and carrying out cross processing on the normalized value characteristics, the normalized attention characteristics and the second priori knowledge to obtain a priori knowledge construction result.

It should be noted that, the first priori knowledge and the second priori knowledge are both autonomously constructed learning vectors, and the learning vectors respectively operate with the normalized vector image features.

Specifically, the operation formula is as follows:

,

wherein,representing a first a priori knowledge,/->Representing a second a priori knowledge,/->Representing a matrix concatenation +.>Weight matrix corresponding to the index feature is represented, +.>Weight matrix corresponding to the representation key feature, < ->And representing a weight matrix corresponding to the value characteristic, wherein K represents a result obtained by performing cross processing on the retrieval characteristic and the key characteristic and the first priori knowledge, and V represents a result obtained by performing cross processing on the value characteristic, the attention characteristic and the second priori knowledge.

The construction of the priori knowledge can be simply understood, for example, if the content actually represented by the features extracted from the image is "milk and bread", and in order to improve diversity and accuracy of subtitle generation, the result of the construction of the priori knowledge, i.e., the "breakfast", can be obtained after the construction of the priori knowledge.

S230, constructing a multi-head attention mechanism according to the priori knowledge construction result to obtain a multi-head attention mechanism construction result;

in the embodiment of the invention, multidimensional vector processing is carried out on the prior knowledge construction result so as to highlight the effective characteristics of the image and obtain a multi-head attention mechanism construction result.

Specifically, the memory is used to enhance the attention, the application is performed in a multi-head mode, and the repeated memory is performed for a plurality of times by using different projection weight matrixes and learnable matrixes of each head, so that the effect of enhancing the memory is achieved.

After the prior knowledge construction is completed, the multi-head attention mechanism construction is carried out, as shown in fig. 6, the prior knowledge construction is carried out, Q, K, V is assigned, and the vector dimension is the same as the data. The specific operation is as follows:

，

。

s240, performing second feedforward full-connection layer processing on the multi-head attention mechanism construction result to obtain effective image characteristics.

After the attention mechanism is built, further Feed-forward full-connection processing is performed, and the processing procedure of the second Feed-forward full-connection layer is the same as that of the first Feed-forward full-connection layer, and is not repeated here.

After the Feed-Forward establishment is completed, GLU gate control unit processing is carried out, and the specific formula is as follows:

，

the encoder is divided into three layers to finally obtain the image characteristic vector asThe output vector is sent to a decoding layer for cross-attention with text features to generate subtitle information.

It should be noted that, performing the gate control unit processing here can make the gradient propagation of the data easier, and the gradient disappearance or gradient explosion is not easy to be caused.

Specifically, as shown in fig. 7, inputting the image valid feature to a decoder to enable multi-modal interaction between the image valid feature and the image description text to obtain an interaction result, including:

s310, acquiring an image description text;

it should be appreciated that in embodiments of the present invention, image description text is given and vectorized.

S320, vectorizing the image description text to obtain an image description text vector;

and carrying out vectorization processing on the image description text to obtain an image description text vector Y.

S330, inputting the image description text vector to a decoder for multi-head attention mechanism processing to obtain an input sequence vector;

specifically, the method comprises the following steps:

In the embodiment of the invention, the image description text vector Y is sent to a decoder for processing, and the specific calculation process is as follows:

，

wherein,position information representing words in a sentence, +.>Representing the dimension of the word vector,/>Which dimension in the word vector is represented. Dividing the text sequence into three groups of identical vectors for multi-head attention mechanism processing, namely dividing the text sequence into query vectors +.>Key vector->Value vector->. And directly carrying out similarity distribution according to the query vector and the key vector to obtain a weighted sum between the two vectors, and obtaining a final input vector sequence through dot product scaling.

S340, cross-connecting the input sequence vector with the effective image features input to the decoder to obtain an interaction result.

Input sequence vector after a given processAnd +.>Attention operator will cross attention mechanism vector +.>And->Not just the processing of the last coding layer, but the cross-linking of the outputs of all coding layers, this multi-level output contribution is added after the operation, finally defined as:

，

wherein,representing the output of the coding layer in the encoder and the post-processing in the decoder>The cross-attention mechanism operation performed is shown in fig. 6. The operation rule is as follows:

，

wherein,representing a learnable weight matrix, +.>Representing the processed input sequence vector, +.>Representing the image characteristics input after the processing of the coding layer. />The operational formula is described above with reference to the foregoing.

The result obtained after the operation of the cross attention mechanism is operatedIs a weight matrix used to adjust the vector transmitted by each coding layer and the relative importance of the different decoding layers. />The measurement is obtained by calculating the correlation between the output and the input through each layer of cross attention mechanism, and the calculation rule is as follows:

。

in the embodiment of the invention, generating the text sequence according to the interaction result comprises the following steps:

Specifically, the interactive result obtained after the text and image are processed finally is processed by the GLU gating unit, and the process of processing by the GLU gating unit can be described in the foregoing, which is not repeated here. After the final stage, feed-forward full connection processing is performed on the processing result of the gating unit, and the Feed-forward full connection processing is also described with reference to the foregoing, and is not repeated herein.

And finally, the text sequence output by the decoder is subjected to diversity promotion processing by using a cluster search algorithm, so as to generate diversified subtitles.

Fig. 8 is a comparison chart of an image caption generated based on the image caption automatic generation method of the present invention and a caption generated based on a baseline-based method of the related art. As shown in fig. 8, the sentences generated by the image captions based on the method of the present invention are more accurate and humanized than the baseline model in the prior art.

In summary, the image subtitle automatic generation method provided by the invention comprises the steps of preprocessing image feature information through a target detection means, extracting feature information in a target frame, keeping the processed feature shapes consistent, inputting the extracted picture features into an improved transform encoder, and encoding priori knowledge on the image features; the characteristic information after image processing is received at the decoder layer and cross attention is carried out on the characteristic information and the text characteristic information to generate text description, model parameters are updated through a back propagation algorithm, and the model performance is optimized. The invention has the advantages that the target detection processing is carried out when the image features are extracted, the effective feature information is extracted, and the noise interference is avoided. In addition, the transducer model can effectively process long-time dependency relationship, and improves the quality and fluency of the generated text description. Meanwhile, the concept of priori knowledge is introduced, so that knowledge derived from the image can be learned when the description is generated, and the accuracy and the integrity of the generated text description are improved.

In addition, the image subtitle automatic generation method provided by the invention uses the persistent memory vector to encode priori knowledge, uses the gating unit GLU to reserve the priori knowledge, controls loss, is not easy to cause gradient disappearance or explosion, makes the gradient spread more easily, removes the residual connection mode in the encoding layer before, and further adds a multi-head attention mechanism to process and generate the relation between the priori knowledge and the original existing characteristic information after the prior knowledge is established. Secondly, in the generation of sentences, the relation of a multi-layer structure is utilized, under the condition that the low-level and high-level visual relation is utilized, not only is one input from a visual mode, but also the sentences processed by a multi-head attention mechanism are combined, image information is fused into semantics, the visual information contributing to each stage is weighted, and the mask tensor messenger is utilized during the process to enable the better fusion of the image characteristics in the text characteristics to be identified.

As another embodiment of the present invention, there is provided an image subtitle automatic generating apparatus for implementing the image subtitle automatic generating method described above, wherein the image subtitle automatic generating apparatus includes:

The working principle of the automatic image caption generating device provided by the invention can refer to the description of the automatic image caption generating method, and the description is not repeated here.

As another embodiment of the present invention, there is provided a computer storage medium storing at least one program instruction that is loaded and executed by a processor to implement the aforementioned image subtitle automatic generation method.

In an embodiment of the present invention, there is provided a non-transitory computer-readable storage medium storing computer-executable instructions that can perform the image subtitle automatic generation method in any of the above-described method embodiments. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. An automatic generation method of image subtitles, characterized by comprising:

generating a text sequence according to the interaction result;

converting the text sequence to obtain an image caption, and outputting the image caption;

inputting the vector image features to an encoder for a priori knowledge construction and obtaining image valid features, comprising:

performing second feedforward full-connection layer processing on the multi-head attention mechanism construction result to obtain effective image characteristics;

carrying out priori knowledge construction according to the first processing result to obtain a priori knowledge construction result, wherein the method comprises the following steps:

generating second priori knowledge, and carrying out cross processing on the normalized value characteristics, the normalized attention characteristics and the second priori knowledge to obtain a priori knowledge construction result;

constructing a multi-head attention mechanism according to the prior knowledge construction result to obtain a multi-head attention mechanism construction result, wherein the multi-head attention mechanism construction result comprises:

2. The automatic generation method of image subtitles according to claim 1, wherein processing the subtitle image to be generated to obtain vector image features comprises:

3. The method for automatically generating an image subtitle according to claim 1, wherein inputting the image valid feature to a decoder to enable multi-modal interaction between the image valid feature and an image description text to obtain an interaction result includes:

acquiring an image description text;

4. The method for automatically generating an image subtitle of claim 3, wherein inputting the image description text vector to a decoder for multi-headed attention mechanism processing to obtain an input sequence vector comprises:

and carrying out similarity distribution according to the query vector and the key vector to obtain a weighted sum between the two vectors, and obtaining a final input vector sequence through scaling dot product processing.

5. The automatic generation method of image subtitles according to claim 1, wherein generating a text sequence according to the interaction result comprises:

6. An image caption automatic generation device for implementing the image caption automatic generation method according to any one of claims 1 to 5, characterized in that the image caption automatic generation device comprises:

7. A computer storage medium storing at least one program instruction that is loaded and executed by a processor to implement the method of automatic generation of image subtitles according to any of claims 1 to 6.