CN114090815A

CN114090815A - Training method and training device for image description model

Info

Publication number: CN114090815A
Application number: CN202111341668.9A
Authority: CN
Inventors: 曹晚霞; 朱飞
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-02-25

Abstract

The application discloses a training method and a training device for an image description model. The training method comprises the following steps: aiming at any candidate image in the word-text pair training set, firstly inputting an image description model after word granularity training to obtain a candidate predicted text, then inputting the candidate image and the candidate predicted text into a pre-trained image-text matching model to determine image-text similarity, then adding the CIDER of the candidate predicted text and the candidate annotated text and the image-text similarity according to a preset proportion to obtain a current reward value, obtaining a parameter updating gradient according to the current reward value, and further finishing fine adjustment of the image description model after word granularity training at a sentence level. The whole training method utilizes a reinforcement learning method to link the pre-trained image-text matching model with the image description model, so that the trained image description model can generate a prediction description text with higher matching degree with an actual image, and the prediction precision of the image description model can be improved.

Description

Training method and training device for image description model

Technical Field

The present disclosure relates to the field of display technologies, and in particular, to a training method and a training device for an image description model.

Background

The description of the image refers to the generation of a description text for introducing the content of the image based on the image, and is a research focus in recent years, similar to "talking with the eye". The image description is applied to the intelligent device (such as an intelligent television) to improve the interaction experience of a user and the intelligent device, however, the image description is simple for people, but as the scenes involved by the image are numerous, and the attributes of the entities in the image and the relationship among the entities vary, the intelligent device is full of challenges to accurately perform the image description.

The intelligent device can generate a prediction description text of the image by adopting a trained image description model, the image description model mainly adopts a coding and decoding structure, and an attention mechanism is introduced in the decoding process, so that a corresponding target area in the image can be focused when a target word in the prediction description text is generated. At present, an image description model is trained, word granularity training is performed on the image description model by using an image-text pair training set, and sentence granularity training is performed on the image description model by using the image-text pair training set after parameters of the image description model are converged. The sentence granularity training is to input a candidate image in any candidate image-text pair in the image-text pair training set into an image description model, to obtain a corresponding candidate predicted text, to determine a CIDER (index representing the similarity of a text tf-idf vector) of the candidate predicted text and a candidate annotation text corresponding to the candidate image, to adjust parameters of the image description model by using a reinforcement learning method (for example, self-evaluation sequence training), that is, to use the CIDER as a current reward value which can be obtained by the image description model, to obtain a parameter update gradient of the image description model according to the current reward value, to determine update parameters of the image description model according to the parameter update gradient, and to adjust the parameters of the image description model continuously until the training is completed.

In the training process, the CIDER is mainly used as a standard of model parameter adjustment, however, for the same candidate image, the situation that the CIDER of the candidate predicted text and the CIDER of the candidate annotation text are lower but the actual meanings are the same may occur, and the situation that the CIDER of the candidate predicted text and the CIDER of the candidate annotation text are higher but the actual meanings are completely different may also occur, so that the degree of matching between the predicted description text generated by the image description model and the actual image is lower by adopting the training method for training.

Disclosure of Invention

The embodiment of the application provides a training method and a training device for an image description model, which can improve the matching degree of a prediction description text generated by the image description model and an actual image.

In a first aspect, an embodiment of the present application provides a training method for an image description model, where the training method includes:

acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;

performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;

performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the target training step comprises the following steps:

inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;

determining the image-text similarity of the candidate image and the candidate predicted text;

determining CIDER of the candidate prediction text and the candidate annotation text;

obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;

acquiring a parameter updating gradient of the intermediate model according to the current reward value;

and adjusting the parameters of the intermediate model by using the parameter updating gradient.

In some embodiments, the determining the teletext similarity between the candidate image and the candidate predicted text comprises:

performing text coding on the candidate predicted text to obtain a plurality of word vectors;

inputting the candidate image and the word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model completes training by using an extended training set and adopting an MOCO learning method, and the extended training set is a data set obtained by carrying out negative example extension on the image-text pair training set.

In some examples, the extended training set is determined by:

acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs;

obtaining a plurality of negative example images corresponding to the candidate annotation text to obtain a plurality of second negative example image-text pairs;

and all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form the extended training set.

In some embodiments, the obtaining a plurality of negative example texts corresponding to the candidate image to obtain a plurality of first negative example image-text pairs includes:

respectively determining the image-text similarity of the candidate image and other label texts in the image-text pair training set except the candidate label text;

acquiring x unmatched texts according to the sequence of the image-text similarity from small to large;

scene rewriting is carried out on the candidate marked texts, and a plurality of rewritten texts are obtained;

acquiring y currency rewriting texts according to the sequence of the currency degrees from high to low;

merging the unmatched text and the fluency rewriting text into a plurality of negative case texts;

the candidate image and each negative example text form a first negative example image-text pair.

In some embodiments, the teletext matching model is trained using the following MOCO learning method:

inputting a training image into the image-text matching model to obtain a training image feature vector output by an image coding module in the image-text matching model, wherein the training image is an image in any training image-text pair in the extended training set;

inputting a training annotation text into the image-text matching model, and acquiring a training text vector output by a text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image;

inputting the training image into a momentum image-text matching model to obtain a positive example image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient;

inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module;

combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set;

inputting the training label text into the momentum image-text matching model, and acquiring a normal text vector output by a momentum text coding module in the momentum image-text matching model;

inputting each negative case text corresponding to the training image into the momentum image-text matching model, and acquiring a plurality of negative case text vectors output by the momentum text coding module;

merging the positive case text vector and the negative case text vector into a momentum text vector set;

determining a first contrast loss according to the training image feature vector and each vector in the momentum text vector set;

determining a second contrast loss according to the training text vector and each vector in the momentum image feature set;

determining a sum of the first and second comparison losses as a total loss;

and adjusting parameters of the image-text matching model according to the total loss.

In some embodiments, the momentum graph matching model is established by the following formula:

CAAN_m＝m·CAAN_m+(1-m)CAAN

wherein, CAAN_mThe image-text matching model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is the image-text matching model.

In some embodiments, performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model includes:

inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text;

determining text loss of the predicted text and the candidate annotation text;

and adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.

In some embodiments, the inputting the candidate image into the intermediate model for image description to obtain a candidate predicted text includes:

extracting image features of the candidate images to obtain image feature vectors;

and generating an image description text for the image feature vector to obtain a candidate prediction text.

In some embodiments, the obtaining, according to the image-text similarity, a preset model hyper-parameter, and the CIDEr, a current reward value that can be obtained by the intermediate model for image description includes:

determining a current prize value obtainable by the image description of the intermediate model by the following formula:

reward＝CIDEr+λS(I,T)

the reward is a current reward value obtained by image description of the intermediate model, the CIDER is the text similarity, the lambda is a preset model hyper-parameter, and the S (I, T) is the image-text similarity.

In a second aspect, an embodiment of the present application provides a training apparatus for an image description model, where the training apparatus includes:

a teletext pair training set acquisition unit configured to: acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;

a first training unit configured to: performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;

a second training unit configured to: performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the second training unit comprises:

an image description subunit configured to: inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;

a teletext similarity determination subunit configured to: determining the image-text similarity of the candidate image and the candidate predicted text;

a text similarity determination subunit configured to: determining CIDER of the candidate prediction text and the candidate annotation text;

a current bonus value obtaining subunit configured to: obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;

a parameter update gradient acquisition subunit configured to: acquiring a parameter updating gradient of the intermediate model according to the current reward value;

a parameter adjustment subunit configured to: and adjusting the parameters of the intermediate model by using the parameter updating gradient.

In some embodiments, the determining the image-text similarity between the candidate image and the candidate predicted text specifically includes:

In some embodiments, the extended training set is determined by:

In some embodiments, the obtaining of the multiple negative example texts corresponding to the candidate image to obtain multiple first negative example image-text pairs specifically includes:

determining a sum of the first and second comparison losses as a total loss;

CAAN_m＝m·CAAN_m+(1-m)CAAN

In some embodiments, the word granularity training of the image description model by using the image-text pair training set is performed to obtain an intermediate model, and specifically, the word granularity training is performed by:

determining text loss of the predicted text and the candidate annotation text;

In some embodiments, the inputting the candidate image into the intermediate model for image description to obtain a candidate predicted text specifically includes:

In some embodiments, the obtaining, according to the image-text similarity, a preset model hyper-parameter, and the CIDEr, a current reward value that can be obtained by the intermediate model performing image description specifically includes:

reward＝CIDEr+λS(I,T)

The technical scheme provided by the embodiment of the application has the following beneficial effects: aiming at any candidate image in the image-text pair training set, firstly inputting an intermediate model after word granularity training to obtain a candidate prediction text, then inputting the candidate image and the candidate prediction text into a pre-trained image-text matching model to determine image-text similarity, then combining CIDER of the candidate prediction text and the candidate annotation text and a preset model hyper-parameter to obtain a current reward value which can be obtained by the intermediate model for image description, obtaining a parameter updating gradient according to the current reward value, and further finishing fine tuning of the intermediate model at a sentence level. In the technical scheme provided by the embodiment of the application, the image similarity and the CIDER are jointly used as the reference standard of the image description model, so that the trained image description model can generate a prediction description text with higher matching degree with an actual image, and the prediction precision of the image description model can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation manner in the related art, a brief description will be given below of the drawings required for the description of the embodiments or the related art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a diagram illustrating an architecture of a decoder in an image description model according to an embodiment of the present application;

fig. 2 shows a flowchart corresponding to a training method for an image description model according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an architecture of a graph-text matching model provided in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a scene rewriting flow of candidate annotation texts according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an architecture of MOCO comparative learning provided by an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a training process of a graph-text matching model provided in an embodiment of the present application;

fig. 7 is a schematic overall flow chart corresponding to a training method for an image description model according to an embodiment of the present disclosure;

fig. 8 shows a schematic structural diagram of a training apparatus for an image description model according to an embodiment of the present application.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The following first explains an image description model in an embodiment of the present application with reference to the drawings.

The main algorithm framework in the image description model is an encoder-decoder (codec) framework, and an object detection algorithm (e.g., fast RCNN) or an image feature extraction algorithm (e.g., Resnet) is firstly adopted to encode an image, and then a decoder is used to decode the image to generate a text description corresponding to the image. An attention module is introduced in the decoding process, and the related part of the image is selected to be generated according to the current context through a top-down attention mechanism, so that the region corresponding to the image can be focused when the word corresponding to the image region is generated. Fig. 1 schematically shows an architecture diagram of a decoder in an image description model provided by an embodiment of the present application, and as shown in fig. 1, the decoder adopts a three-layer LSTM (long short memory network) structure, where the three-layer LSTM are LSTM0, LSTM1, and LSTM2, respectively, and an input of the i-th layer LSTM includes two parts x_t ⁱ、

Wherein the number i of layers of the LSTM is 0, 1 or 2. t and t-1 are the current time and the previous time, x, respectively_t ⁱIs the input of the ith layer at the current time,

for the output of the ith layer at the last moment, the output of the ith layer LSTM at the current moment

Fully connected is the Fully connected layer, y_tSoftmax is a common calculation for neural network classification for the results of fully-connected layer output. In generating each word, the input of the LSTM takes into account both the already generated description information and the associated image information selected by the attention mechanism (i.e., attentive in fig. 1).

For LSTM0, the input for layer 0 at the current time may be represented by equation (1):

in the formula (1), the first and second groups,

is input to the 0 th layer at the present time]In order to splice the symbols for the vector,

the output of layer 2 (i.e. LSTM2) at the previous time,

average of K image region feature vectors, v, derived for fast-R-CNN or mesh features_iThe feature vector of the image area is obtained; w_eΠ_tA word vector (W) for the current input word_e∈R^E×|∑|Is a word vector matrix, Π, about the dictionary | ∑ |)_tOne hot encoding of the input word for time t). LSTM0 fuses the output from the previous time instant while fusing both image and text modality information. LSTM0 output is

Will be part of the next layer LSTM input.

For LSTM1, the input for layer 1 at the current time may be represented by equation (2):

in the formula (2), the first and second groups,

is an input for layer 1 at the present moment,

weighted average of K image region feature vectors derived for fast-R-CNN or grid features, v_iIs a feature vector of an image region, a_i,tFor attention weight, the method depends on an image and a description which is generated before t time, indicates which targets in the image should be focused when words are generated at the t time, and filters a picture area needing to be focused according to the description generated before to generate a vocabulary at the current time; []In order to splice the symbols for the vector,

is the output of layer 0 at the current moment. LSTM1 output is

Will be part of the next layer LSTM input.

For LSTM2, the input for layer 2 at the current time may be represented by equation (3):

in the formula (3), the first and second groups,

is an input for layer 2 at the present time,

weighted average of K image region feature vectors derived for fast-R-CNN or grid features, v_iIs a feature vector of an image region,

is a fully connected network. LSTM2 output is

Finally, the output of the three-layer LSTM is connected

Obtaining y over a fully connected network_tAnd then obtaining the conditional probability distribution of the current word through a Softmax layer, thus finally obtaining the probability distribution of the prediction sequence. The objective function of the image description model is to minimize the cross entropy loss of probability distribution of a prediction sequence and a labeling sequence, the model is fused with image information at each moment of generating description, and the model is guided to pay attention to the relevant area of an image for sentence generation through an attention mechanism in a multilayer LSTM extracting image and text information, so that the generated result is more fit with the image.

The image description model is trained according to cross entropy loss of a language model at word granularity, then the model is further optimized by reinforced learning, such as an SCST (Self-evaluation Sequence Training) method, on sentence granularity by using CIDER as a reward of the reinforced learning, wherein CIDER is used for regarding each sentence as a document and representing each sentence as a TF-IDF (term frequency-inverse document frequency) vector, and then the cosine similarity between a reference image description and the image description generated by the model is calculated. However, the repetition rate of the generated description and the labeled description words is high and is not necessarily similar to the semanteme of the sentence, so that the matching degree of the description generated by the image description model and the image is not high in some cases.

In order to improve the matching degree between a prediction description text generated by an image description model and an actual image, the embodiment of the application provides a training method of the image description model. Fig. 2 exemplarily shows a flow diagram corresponding to a training method for an image description model provided in an embodiment of the present application, and as shown in fig. 2, the method specifically includes the following steps:

201: and acquiring a picture-text pair training set.

Specifically, the graphic-text pair training set comprises a plurality of graphic-text pairs, and each graphic-text pair comprises an image and annotation text for describing the content of the image. The labeling text in the image-text pair is obtained by labeling the image by adopting a professional method, such as expert labeling.

Illustratively, one picture of playing basketball and the corresponding text label "two players play basketball on the course", i.e. one picture-text pair.

202: and performing word granularity training on the image description model by adopting the image-text pair training set to obtain an intermediate model.

In some embodiments, the image description model may be word-granularity trained using a training set using graphics and text to obtain an intermediate model:

firstly, inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in an image-text pair training set to obtain a predicted text.

Specifically, the image description includes image feature extraction and image description text generation. The candidate teletext pair comprises a candidate image and a candidate annotation text describing the content of the candidate image.

And inputting the candidate image into the image description model to be trained, and extracting image features to obtain an image feature vector. The image feature extraction comprises grid feature extraction, self-adaptive average pooling and ensemble averaging. The grid feature extraction may adopt a pre-trained Resnet101 model, and is not particularly limited.

Illustratively, the target results of the last layer obtained by extracting the grid features are respectively subjected to adaptive average pooling and integral average, and the adaptive average pooling can divide the target results into 7 × 7 grids to obtain 7 × 7 × 2048 dimensional partitioned area image features; the ensemble averaging may average 49 partitioned image features to obtain a 2048-dimensional image feature vector.

After the image feature extraction is performed, the image description text generation is performed again. When the image description text is generated, a word segmentation tool (for example, jieba) can be used for segmenting the candidate annotation text of the candidate image, and a plurality of word embedding vectors are obtained in a word embedding manner. And finally, obtaining a predicted text of the candidate image according to the word embedding vector and the image characteristic vector.

Then, text loss of the predicted text and the candidate annotation text is determined.

And finally, adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.

203: and performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until the model parameters of the intermediate model are converged. Wherein the target training step comprises:

2031: and inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text.

Specifically, the image description includes image feature extraction and image description text generation.

Inputting the candidate image into the intermediate model, firstly, carrying out image feature extraction on the candidate image to obtain an image feature vector, and then carrying out image description text generation on the image feature vector to obtain a candidate prediction text.

2032: and determining the image-text similarity of the candidate image and the candidate predicted text.

In some embodiments, the teletext similarity of the candidate image and the candidate predicted text may be determined by:

firstly, text coding is carried out on the candidate predicted texts to obtain a plurality of word vectors.

In some embodiments, text encoding may be implemented by bidirectional GRU (gated round robin) or LSTM (long-short memory network), and a pre-trained BERT (a pre-trained text characterization model) model may also be used.

And secondly, inputting the candidate image and the plurality of word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text.

The image-text matching model completes training by using an extended training set and an MOCO learning method, wherein the extended training set is a data set obtained by carrying out negative example extension on the training set by using images and texts.

The following describes a graph-text matching model provided in the embodiment of the present application.

Fig. 3 exemplarily shows an architecture diagram of a graph-text matching model provided by the embodiment of the present application, and as shown in fig. 3, a is an image coding branch of the graph-text matching model (also called CAAN model) provided by the embodiment of the present application, a candidate image is coded into an image feature vector through a Bottom-up attention mechanism (i.e., Bottom-up), the image feature vector is denoted by V, and K denotes the number of the image feature vectors. And B is a text coding branch of the CAAN model, and a Bi-GRU (bidirectional gated recurrent neural network) is adopted to perform text coding on the candidate predicted text to obtain a plurality of word vectors, wherein the word vectors are represented by U. C is a processing branch of the CAAN model, a similarity matrix H of words in the candidate prediction text and regions in the candidate image is calculated according to the image feature vector V and the word vector U, then the attention (i.e. inter-modal attention) among the modalities and the weight of the attention (i.e. intra-modal attention) in the modalities are sequentially calculated, the image feature vector and the word vector are superposed and mapped to the multi-modal joint space through the attention weight in the modalities and the attention weight among the modalities, and the picture vector in the multi-modal joint space is obtained

And text vectors

Wherein,

f (V, U) represents attention weight of each region of the image, and g (V, U) represents attention collision between each word of the text. Computing a picture vector

And text vectors

The similarity between the images is the image-text similarity. In fig. 3, Element-wise Sum indicates Element-wise summation, Element-wise Product indicates Element-wise Product, and Matrix Multiplication indicates Matrix Multiplication.

The image-text matching model provided by the embodiment of the application completes training by adopting an extended training set, wherein the extended training set is a data set obtained by carrying out negative example extension on the training set by the image-text.

In some embodiments, the augmented training set may be determined by:

the method comprises the following steps of firstly, obtaining a plurality of negative example texts corresponding to candidate images to obtain a plurality of first negative example image-text pairs.

In some embodiments, the first negative example teletext pair may be obtained by:

step one, respectively determining the image-text similarity of the candidate image and other label texts except the candidate label text in the image-text pair training set.

Specifically, the graph-text matching model shown in fig. 3 may be used to determine the graph-text similarity.

And step two, acquiring x unmatched texts according to the sequence of the image-text similarity from small to large.

That is, the x pieces of annotation text that most mismatch with the candidate image are regarded as unmatched texts.

And step three, carrying out scene rewriting on the candidate marked texts to obtain a plurality of rewritten texts.

Specifically, the candidate tagged text may be parsed through a text scene graph, where the text scene graph is composed of entities, attributes, and relationships, where the entities are represented by rectangles, the attributes are represented by diamonds, and the relationships are represented by circles. And then, performing near-meaning word replacement on each analyzed word by using a word dictionary, wherein the word dictionary comprises a solid dictionary, an attribute dictionary and a relation dictionary, and realizing scene rewriting of the candidate tagged text by replacing the corresponding word in the candidate tagged text with the near-meaning word in the corresponding dictionary to obtain a plurality of rewritten texts corresponding to the candidate image, for example, the near-meaning word of the noun 'player' can be 'player' or 'sports key', the near-meaning words of the adjective 'two' can be 'two', and the text 'two players play basketball on the court' can be rewritten into 'two sports keys playing basketball on the court'.

And step four, acquiring y passage rewriting texts according to the sequence of the passage degrees from high to low.

Illustratively, the candidate annotation text is "two players play basketball on the court", fig. 4 exemplarily shows a scene rewriting flow diagram of the candidate annotation text provided by the embodiment of the present application, as shown in fig. 4, the entities in the candidate annotation text are "player", "court" and "basketball", the relationships are "on", "up" and "playing", and the attribute is "two". The pre-established attribute dictionary comprises 'one', 'beautiful', 'handsome' and the like, the entity dictionary comprises 'man', 'worker', 'dog' and the like, and the relation dictionary comprises 'kick', 'sit', 'stand' and the like. Similar to the replacement of relation words, similar to the replacement of attribute words, similar to the replacement of relation words, are randomly selected from the entity dictionary for the entity 'athlete' to replace, for example, the athlete may be replaced by 'man'. And 1/3 probability is replaced for each word in the sentence, each sentence is randomly rewritten for many times, and the sentence with the first 100 degrees of compliance is selected as the scene rewriting text of the candidate label text.

And step five, combining the unmatched text and the smooth and correct rewriting text into a plurality of negative example texts.

And step six, forming a first negative example image-text pair by the candidate image and each negative example text.

And secondly, acquiring a plurality of negative example images corresponding to the candidate annotation texts to obtain a plurality of second negative example image-text pairs.

In some embodiments, the second negative example teletext pair may be obtained by:

step one, respectively determining the image-text similarity of the candidate annotation text and other images except the candidate image in the image-text pair training set.

And step two, acquiring x unmatched images according to the sequence of the image-text similarity from small to large.

That is, the x images that most mismatch with the candidate annotation text are taken as mismatched images.

It should be noted that, in order to make the number of images and texts consistent, the number of unmatched images and the number of unmatched texts should be the same.

And step three, randomly sampling pictures in the data set by other pictures and texts to obtain y sampled pictures.

It should be noted that the teletext pair data set and the teletext pair training set are two independent data sets.

It should be noted that, in order to make the number of images and texts consistent, the number of sampling pictures should be the same as the number of the texts to be rewritten.

And step four, combining the unmatched images and the sampling images into a plurality of negative example images.

And step five, forming a second negative example image-text pair by the candidate label text and each negative example image.

And thirdly, forming an extended training set by all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set.

By the embodiment, a large number of negative examples can be introduced on the basis of the image-text pair training set, so that the training set can be expanded without spending higher cost, the negative examples are easy to obtain, more variable scenes can be covered, and the training of a subsequent image-text matching model is facilitated.

Therefore, the CAAN model provided by the embodiment of the application takes the triple loss as a loss function during training, and in order to enhance the representation capability of the model and enable the model to have more fine image-text matching capability, MOCO comparative learning is introduced on the basis of CAAN so that the model can learn more negative examples.

Features learned using MOCO-based unsupervised learning structures for ImageNet classification can be overridden by surveillanceSupervising the learning performance. As shown in fig. 5, the MOCO contrasts and learns the structure, which is inspired by NLP task, by respectively encoding the picture data into query vector and key vector, i.e. query q and key queue k, where the queue includes a single positive sample and a plurality of negative samples. The feature representation is learned by contrast loss, the main line is still invariant: in the training process, the similarity between each query vector and the corresponding key vector is improved as much as possible, and the similarity between the query vector and the key vectors of other pictures is reduced. MOCO encodes data using two neural networks: an encoder and a momentum encoder. The encoder is responsible for encoding the abstract representation of the current instance and the momentum encoder is responsible for encoding the abstract representation of the plurality of instances (including the current instance). For the current example, the encoding results of its encoder and its own in the momentum encoder are maximized while the encoding results of the other examples in the momentum encoder are minimized. In FIG. 5, x^queryRepresents a query vector, and,

representing key vectors, encoder representing an encoder, momentum encoder representing a momentum encoder, similarity representing similarity, and coherent loss representing contrast loss.

The image-text matching model provided by the embodiment of the application is trained by utilizing an extended training set. Before training begins, a momentum image-text matching model, namely a secondary model of the image-text matching model, is established according to the image-text matching model and a preset proportionality coefficient.

The momentum image-text matching model comprises a momentum image coding module and a momentum text coding module.

In some embodiments, the momentum teletext matching model may be established by equation (4):

CAAN_m＝m·CAAN_m+ (1-m) CAAN formula (4)

In formula (4), CAAN_mThe model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is an image-text matching model.

Fig. 6 schematically illustrates a training flow of the graph-text matching model provided in the embodiment of the present application, and as shown in fig. 6, the graph-text matching model provided in the embodiment of the present application may be trained by the following steps:

firstly, inputting a training image into a graph-text matching model to obtain a training image characteristic vector Z output by an image coding module in the graph-text matching model_j ^I。

The training image is an image in any training image-text pair in the extended training set, and the image-text matching model further comprises a text coding module. CANN^I(. for image coding modules in a graph-text matching model, CANN^T(. -) represents a text encoding module in a graph-text matching model.

Secondly, inputting the training annotation text into the image-text matching model to obtain a training text vector Z output by a text coding module in the image-text matching model_j ^T。

The training annotation text is a formal annotation text corresponding to the training image.

Thirdly, inputting the training image into the momentum image-text matching model to obtain the feature vector P of the positive example image output by the momentum image coding module_j ^I。

In FIG. 6

A momentum image coding module in the image matching model for representing momentum,

and the Push represents that the image coding vectors obtained from the batch are pushed into an image coding object queue.

Fourthly, inputting each negative example training image of the training annotation text into the momentum image-text matching model to obtain a plurality of negative example image feature vectors N output by the momentum image coding module_t ^I。

Fifthly, the feature vector P of the positive example image is processed_j ^IAnd negative example image feature vector N_t ^ICollectively merged into a momentum image feature set Q^I。

Sixthly, inputting the training label text into the momentum image-text matching model to obtain a normal text vector P output by the momentum text coding module_j ^T。

Seventhly, inputting each negative example text of the training image into the momentum image-text matching model to obtain a plurality of negative example text vectors N output by the momentum text coding module_t ^T。

Step eight, a positive example text vector P is added_j ^TAnd negative example text vector N_t ^TAre jointly merged into a momentum text vector set Q^T。

The ninth step, according to the feature vector Z of the training image_j ^IAnd a set of momentum text vectors Q^TFor each vector, determining a first contrast loss L_I2T。

In some embodiments, the first contrast loss L may be determined by equation (5)_I2T：

In the formula (5), L_I2TFor the first contrast loss, Z_j ^IFor training the image feature vectors, P_j ^TAs a positive example text vector, N^TIs the set of all negative example text vectors in the momentum text vector set, J is the J-th pair of data, J is the total number of data, tau is the temperature parameter, Q^TIs a set of momentum text vectors.

The tenth step, according to the training text vector Z_j ^TAnd a momentum image feature set Q^IFor each vector, determining a second contrast loss L_T2I。

In some embodiments, the second contrast loss L may be determined by equation (6)_T2I：

In the formula (6), L_T2IIs a secondLoss of contrast, Z_j ^TTo train a text vector, P_j ^IIs a positive example image feature vector, N^IIs the set of all negative example image feature vectors in the momentum image feature set, J is the jth pair of data, J is the total number of data, tau is the temperature parameter, Q^IIs a set of momentum image features.

A tenth step of comparing the first contrast loss L_I2TLoss L in comparison with the second_T2IThe sum of (a) is determined as the total loss L.

And step ten, adjusting parameters of the image-text matching model according to the total loss L.

Through the embodiment, the CAAN model trained by the MOCO comparison learning method can greatly improve the representation capability of the model by learning a large number of easily obtained and significant negative examples, and the accuracy of image-text matching judgment is greatly improved.

2033: and determining CIDER of the candidate predicted text and the candidate annotation text.

2034: and obtaining the current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, the preset model hyper-parameter and the CIDER.

In some embodiments, the current prize value that can be obtained by the image description of the intermediate model can be determined by equation (7):

reward ═ CIDER + λ S (I, T) equation (7)

In the formula (7), reward is a current reward value obtained by image description of the intermediate model, CIDEr is text similarity CIDEr, λ is a preset model hyper-parameter, and S (I, T) is image-text similarity.

2035: and acquiring a parameter updating gradient of the intermediate model according to the current reward value.

Specifically, according to the current reward value, a parameter update gradient of the intermediate model can be acquired by using a reinforcement learning method, such as an SCST method.

The reinforcement learning method provided in the embodiments of the present application is explained below.

In the reinforcement learning method, LSTMs (each LSTM layer) in the algorithm framework responsible for describing generation may be used as Agent agents; taking an input image, a generated word, a word list and the like except the Agent as Environment, and interacting with the Agent; taking the cell State and the hidden State of the LSTMs, the attention weight and the like as a State State; taking the generated next word as an Action; regarding all the network parameters theta as a strategy p theta, and the strategy determines how to act (generate the next word) according to the State; when the end word EOS is generated, the Agent obtains a current reward value, the reward is calculated and obtained by comparing the generated sentence with the labeled group-route (basic fact) by using the CIDER score or/and the combination value of other indexes.

The reinforcement learning algorithm adopts a strategy gradient algorithm, and the training optimization goal is to maximize the expected reward, namely, minimize the negative value of the expected reward, which can be expressed by formula (8):

in practical application, the formula (8) can be simplified into the formula (9) through a single sample:

L(θ)＝-r(w^s)，w^s～p_θformula (9)

In the formula (8) and the formula (9),

w_t ^sthe word generated by sampling the model at the T-th moment, wherein T is more than or equal to 1 and less than or equal to T; l (θ) is a loss function, p_θFor the probability of generating a word when the model parameter is θ, r (w)^s) In order to sample the reward for a sentence,

is a desire for a reward.

The strategic gradient algorithm may be optimized for the objective function by calculating the gradient of the desired reward by equation (10):

in the formula (10), the first and second groups,

in order to be able to obtain a gradient of losses,

w_t ^sthe word generated by sampling the model at the T-th moment, wherein T is more than or equal to 1 and less than or equal to T; p is a radical of_θFor the probability of generating a word when the model parameter is θ, r (w)^s) In order to sample the reward for a sentence,

in order to be a desire for a reward,

indicating the gradient.

In actual calculation, the expected value cannot be directly calculated, so that the expected value is calculated by adopting monte carlo sampling, and each training sample in one batch is calculated by the formula (11):

in the formula (11), the reaction mixture,

to lose the gradient, r (w)^s) In order to sample the reward for a sentence,

representing the gradient, p_θThe probability of generating a word when the model parameter is theta,

w_t ^st is greater than or equal to 1 and less than or equal to T for the word generated for the model sample at time T.

The above calculation formula causes a problem when a certain sample is sampledAt the time of sampling, r (w)^s) Positive, in order to bring the gradient closer to the optimal direction, p is assigned_θ(w^s) And also becomes larger, the probability that the sample will be sampled in the future becomes larger, and the probability that other samples will be sampled becomes smaller, which is just because other samples are not sampled initially, obviously unfair, and it is also possible that an optimal solution cannot be found. Therefore, one method commonly used at present is to add a baseline term b, which is expressed by formula (12):

b may be any function as long as it satisfies the condition that the desired prize value is not changed. To satisfy this condition, b is required to be independent of the action, and is specifically derived by equation (13):

increasing b does not change the desired reward gradient calculation, but may decrease the variance estimate of the desired reward gradient. For each sample, the desired reward gradient may be approximated by equation (14):

in the formula (12), the formula (13) and the formula (14),

in order to be able to obtain a gradient of losses,

in order to be a desire for a reward,

representing the gradient, b is an arbitrary function.

The chain rule described by the following equation (15) is used:

wherein, in the formula (15)

Can be expressed by equation (16):

since the core idea of SCST is to use the currently trained model for the test set to predict the obtained reward value as basemine, equation (16) becomes equation (17):

in the formula (15), the formula (16) and the formula (17),

predicting the value of the reward earned for the test set for the currently trained model, S_tIs the softmax activation function and,

for the gradient of the loss, L (θ) is the loss function, θ is the gradient variable, w_tWords generated for the current time obtained by greedy sampling, h_tIn the hidden state of the LSTM network,

w_t ^swords generated for model sampling at time T, T being greater than or equal to 1 and less than or equal to T, p_θFor the probability of generating a word when the model parameter is θ, r (w)^s) A reward for sampling sentences.

The reward value sampled from the current model can be higher than

The probability of sampling the sample will increase and vice versa decrease.

The current model is used for test set prediction using greedy search strategy, which is expressed by equation (18):

in the formula (18), the first and second groups,

vector shape of wtIn the formula (II), the compound (II) is shown in the specification,w_tfor words generated at the current time obtained by greedy sampling, argmax represents the maximum value, p represents the probability, h_tIs a hidden state of the LSTM network.

2036: and updating the parameters of the gradient adjustment intermediate model by using the parameters.

Through the embodiment, the graph-text similarity calculated by the CAAN model pre-trained by the MOCO method is added into the reward for optimization by adopting the reinforcement learning method, so that the accuracy and the recall rate of the sentences can be kept, the generated sentences can describe the general contents of the pictures, more detailed description can be provided, and the matching degree of the generated prediction texts and the pictures is greatly improved.

To more clearly illustrate the embodiment of the present application, fig. 7 exemplarily shows an overall flowchart corresponding to a training method for an image description model provided by the embodiment of the present application, as shown in fig. 7, the image description model includes an image feature extraction module, a word embedding module and a scenario generation module, the image feature vectors are obtained by extracting features of candidate image input image feature extraction modules, the image feature vectors and words Generated by the word embedding module are input into the scenario generation module together, a predicted text Generated scenario (for example, the predicted text is "two players play basketball on a stadium") is obtained, a text similarity CIDEr between the predicted text and a labeled text group try (for example, the labeled text is "two players play basketball on a basketball court") is determined, and at the same time, the predicted text is encoded by the text encoding module and then input into a CAAN model of the MOCO contrast learning method together with the image feature vectors (it should be noted that, the structure of the CAAN model trained by the MOCO contrast learning method is completely the same as that of the image-text matching model shown in FIG. 3), image-text Similarity Simiarity of the candidate image and the predicted text is obtained, the image-text Similarity Simiarity and the text Similarity CIDER jointly determine a current reward value reward, finally, a parameter adjusting gradient of the image description model is generated according to the current reward value reward through the SCST module, further parameter adjustment of the image description model is achieved, and the steps are repeated continuously until the parameters of the image description model are converged, and the training is finished. In fig. 7, FC0 and FC1 both represent fully connected layers.

According to the embodiment, fine adjustment of the intermediate model in sentence granularity is achieved by executing a target training step on the intermediate model obtained after word granularity training, wherein the target training step comprises the steps of inputting the candidate image into the intermediate model to obtain a candidate predicted text, determining the image-text similarity between the candidate image and the candidate predicted text, then combining the CIDER of the candidate predicted text and the candidate marked text and a preset model hyper-parameter to obtain a current reward value which can be obtained by image description of the intermediate model, obtaining a parameter updating gradient according to the current reward value, and further adjusting model parameters. In the technical scheme provided by the embodiment of the application, a reinforcement learning method is adopted, and the image-text similarity obtained by a pre-trained CAAN model is added into the reward of the SCST, so that the problem that the original SCST is only optimized by CIDER to cause too single generation description can be solved, and the image-text matching model capable of learning a large number of negative examples is used as a teaching model to effectively improve the description generation capability of the image description model and enable the generated description to be more matched with the image.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 8 schematically illustrates a structural diagram of a training apparatus for an image description model according to an embodiment of the present application. As shown in fig. 8, the apparatus has a function of implementing the training method of the image description model, and the function may be implemented by hardware, or may be implemented by hardware executing corresponding software. The apparatus may include: a teletext pair training set acquisition unit 801, a first training unit 802 and a second training unit 803.

A graph-text pair training set acquisition unit 801 configured to: and acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image.

A first training unit 802 configured to: and performing word granularity training on the image description model by adopting the image-text pair training set to obtain an intermediate model.

A second training unit 803 configured to: performing a target training step on the intermediate model by using any one of the candidate image-text pairs in the training set until the model parameters of the intermediate model converge, where the candidate image-text pairs include candidate images and candidate annotation texts, and the second training unit 803 includes:

an image description subunit 8031 configured to: and inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation.

A teletext similarity determination subunit 8032 configured to: and determining the image-text similarity of the candidate image and the candidate predicted text.

A text similarity determination subunit 8033 configured to: and determining CIDER of the candidate predicted text and the candidate annotation text.

A current prize value obtaining subunit 8034 configured to: and obtaining the current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, the preset model hyper-parameter and the CIDER.

A parameter update gradient acquisition subunit 8035 configured to: and acquiring a parameter updating gradient of the intermediate model according to the current reward value.

A parameter adjusting subunit 8036 configured to: and updating the parameters of the gradient adjustment intermediate model by using the parameters.

In some embodiments, determining the image-text similarity between the candidate image and the candidate predicted text specifically includes:

and performing text coding on the candidate predicted text to obtain a plurality of word vectors.

Inputting the candidate image and a plurality of word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model utilizes an extended training set and adopts an MOCO learning method to complete training, and the extended training set is a data set obtained by carrying out negative example extension on the training set by the images and the texts.

In some embodiments, the augmented training set is determined by:

and acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs.

And acquiring a plurality of negative example images corresponding to the candidate annotation texts to obtain a plurality of second negative example image-text pairs.

All the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form an extended training set.

In some embodiments, obtaining a plurality of negative example texts corresponding to the candidate image to obtain a plurality of first negative example image-text pairs specifically includes:

and respectively determining the image-text similarity of the candidate image and other label texts except the candidate label texts in the image-text pair training set.

And acquiring x unmatched texts according to the sequence of the image-text similarity from small to large.

And carrying out scene rewriting on the candidate marked texts to obtain a plurality of rewritten texts.

And acquiring y passing rewriting texts according to the sequence of passing degrees from high to low.

And combining the unmatched text and the fluency rewriting text into a plurality of negative example texts.

inputting the training image into the image-text matching model, and obtaining the training image characteristic vector output by the image coding module in the image-text matching model, wherein the training image is the image in any training image-text pair in the extended training set.

Inputting the training annotation text into the image-text matching model, and acquiring a training text vector output by the text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image.

Inputting the training image into a momentum image-text matching model, and acquiring a normal image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient.

And inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module.

And combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set.

And inputting the training label text into the momentum text matching model, and acquiring a normal text vector output by the momentum text coding module in the momentum text matching model.

And inputting each negative example text corresponding to the training image into the momentum image-text matching model to obtain a plurality of negative example text vectors output by the momentum text coding module.

And combining the positive case text vector and the negative case text vector into a momentum text vector set.

A first contrast loss is determined based on the training image feature vectors and each vector in the set of momentum text vectors.

A second contrast loss is determined based on the training text vector and each vector in the momentum image feature set.

The sum of the first and second contrast losses is determined as the total loss.

CAAN_m＝m·CAAN_m+(1-m)CAAN

wherein, CAAN_mThe model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is an image-text matching model.

In some embodiments, word granularity training is performed on the image description model by using a training set using graphics and text to obtain an intermediate model, specifically:

and inputting the candidate images into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text.

And determining text loss of the predicted text and the candidate marked-up text.

In some embodiments, the candidate image is input to an intermediate model for image description to obtain a candidate predicted text, specifically:

and extracting image features of the candidate images to obtain image feature vectors.

In some embodiments, the current reward value obtained by the image description of the intermediate model is obtained according to the image-text similarity, the preset model hyper-parameter and the CIDEr, and specifically:

the current reward value obtained by the image description of the intermediate model is determined by the following formula:

reward＝CIDEr+λS(I,T)

wherein reward is a current reward value obtained by image description of the intermediate model, CIDER is text similarity, lambda is a preset model hyper-parameter, and S (I, T) is image-text similarity.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A training method for an image description model, the training method comprising:

2. The training method of claim 1, wherein the determining the teletext similarity of the candidate image and the candidate predictive text comprises:

3. Training method according to claim 2, wherein the extended training set is determined by:

4. The training method of claim 3, wherein the obtaining a plurality of negative examples texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs comprises:

5. A training method as claimed in claim 3, wherein the teletext matching model is trained using the following MOCO learning method:

determining a sum of the first and second comparison losses as a total loss;

6. The training method of claim 5, wherein the momentum graph matching model is established by the following formula:

CAAN_m＝m·CAAN_m+(1-m)CAAN

7. The training method of claim 1, wherein performing word granularity training on the image description model using the image-text pair training set to obtain an intermediate model comprises:

determining text loss of the predicted text and the candidate annotation text;

8. The training method of claim 1, wherein the inputting the candidate images into the intermediate model for image description to obtain candidate predicted texts comprises:

9. The training method according to claim 1, wherein obtaining the current reward value obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER comprises:

reward＝CIDEr+λS(I,T)

10. An apparatus for training an image description model, the apparatus comprising: