CN114090815A - Training method and training device for image description model - Google Patents

Training method and training device for image description model Download PDF

Info

Publication number
CN114090815A
CN114090815A CN202111341668.9A CN202111341668A CN114090815A CN 114090815 A CN114090815 A CN 114090815A CN 202111341668 A CN202111341668 A CN 202111341668A CN 114090815 A CN114090815 A CN 114090815A
Authority
CN
China
Prior art keywords
image
text
training
model
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111341668.9A
Other languages
Chinese (zh)
Inventor
曹晚霞
朱飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Wuhan Co ltd
Original Assignee
Hisense Electronic Technology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Wuhan Co ltd filed Critical Hisense Electronic Technology Wuhan Co ltd
Priority to CN202111341668.9A priority Critical patent/CN114090815A/en
Publication of CN114090815A publication Critical patent/CN114090815A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a training method and a training device for an image description model. The training method comprises the following steps: aiming at any candidate image in the word-text pair training set, firstly inputting an image description model after word granularity training to obtain a candidate predicted text, then inputting the candidate image and the candidate predicted text into a pre-trained image-text matching model to determine image-text similarity, then adding the CIDER of the candidate predicted text and the candidate annotated text and the image-text similarity according to a preset proportion to obtain a current reward value, obtaining a parameter updating gradient according to the current reward value, and further finishing fine adjustment of the image description model after word granularity training at a sentence level. The whole training method utilizes a reinforcement learning method to link the pre-trained image-text matching model with the image description model, so that the trained image description model can generate a prediction description text with higher matching degree with an actual image, and the prediction precision of the image description model can be improved.

Description

Training method and training device for image description model
Technical Field
The present disclosure relates to the field of display technologies, and in particular, to a training method and a training device for an image description model.
Background
The description of the image refers to the generation of a description text for introducing the content of the image based on the image, and is a research focus in recent years, similar to "talking with the eye". The image description is applied to the intelligent device (such as an intelligent television) to improve the interaction experience of a user and the intelligent device, however, the image description is simple for people, but as the scenes involved by the image are numerous, and the attributes of the entities in the image and the relationship among the entities vary, the intelligent device is full of challenges to accurately perform the image description.
The intelligent device can generate a prediction description text of the image by adopting a trained image description model, the image description model mainly adopts a coding and decoding structure, and an attention mechanism is introduced in the decoding process, so that a corresponding target area in the image can be focused when a target word in the prediction description text is generated. At present, an image description model is trained, word granularity training is performed on the image description model by using an image-text pair training set, and sentence granularity training is performed on the image description model by using the image-text pair training set after parameters of the image description model are converged. The sentence granularity training is to input a candidate image in any candidate image-text pair in the image-text pair training set into an image description model, to obtain a corresponding candidate predicted text, to determine a CIDER (index representing the similarity of a text tf-idf vector) of the candidate predicted text and a candidate annotation text corresponding to the candidate image, to adjust parameters of the image description model by using a reinforcement learning method (for example, self-evaluation sequence training), that is, to use the CIDER as a current reward value which can be obtained by the image description model, to obtain a parameter update gradient of the image description model according to the current reward value, to determine update parameters of the image description model according to the parameter update gradient, and to adjust the parameters of the image description model continuously until the training is completed.
In the training process, the CIDER is mainly used as a standard of model parameter adjustment, however, for the same candidate image, the situation that the CIDER of the candidate predicted text and the CIDER of the candidate annotation text are lower but the actual meanings are the same may occur, and the situation that the CIDER of the candidate predicted text and the CIDER of the candidate annotation text are higher but the actual meanings are completely different may also occur, so that the degree of matching between the predicted description text generated by the image description model and the actual image is lower by adopting the training method for training.
Disclosure of Invention
The embodiment of the application provides a training method and a training device for an image description model, which can improve the matching degree of a prediction description text generated by the image description model and an actual image.
In a first aspect, an embodiment of the present application provides a training method for an image description model, where the training method includes:
acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;
performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;
performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the target training step comprises the following steps:
inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;
determining the image-text similarity of the candidate image and the candidate predicted text;
determining CIDER of the candidate prediction text and the candidate annotation text;
obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;
acquiring a parameter updating gradient of the intermediate model according to the current reward value;
and adjusting the parameters of the intermediate model by using the parameter updating gradient.
In some embodiments, the determining the teletext similarity between the candidate image and the candidate predicted text comprises:
performing text coding on the candidate predicted text to obtain a plurality of word vectors;
inputting the candidate image and the word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model completes training by using an extended training set and adopting an MOCO learning method, and the extended training set is a data set obtained by carrying out negative example extension on the image-text pair training set.
In some examples, the extended training set is determined by:
acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs;
obtaining a plurality of negative example images corresponding to the candidate annotation text to obtain a plurality of second negative example image-text pairs;
and all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form the extended training set.
In some embodiments, the obtaining a plurality of negative example texts corresponding to the candidate image to obtain a plurality of first negative example image-text pairs includes:
respectively determining the image-text similarity of the candidate image and other label texts in the image-text pair training set except the candidate label text;
acquiring x unmatched texts according to the sequence of the image-text similarity from small to large;
scene rewriting is carried out on the candidate marked texts, and a plurality of rewritten texts are obtained;
acquiring y currency rewriting texts according to the sequence of the currency degrees from high to low;
merging the unmatched text and the fluency rewriting text into a plurality of negative case texts;
the candidate image and each negative example text form a first negative example image-text pair.
In some embodiments, the teletext matching model is trained using the following MOCO learning method:
inputting a training image into the image-text matching model to obtain a training image feature vector output by an image coding module in the image-text matching model, wherein the training image is an image in any training image-text pair in the extended training set;
inputting a training annotation text into the image-text matching model, and acquiring a training text vector output by a text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image;
inputting the training image into a momentum image-text matching model to obtain a positive example image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient;
inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module;
combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set;
inputting the training label text into the momentum image-text matching model, and acquiring a normal text vector output by a momentum text coding module in the momentum image-text matching model;
inputting each negative case text corresponding to the training image into the momentum image-text matching model, and acquiring a plurality of negative case text vectors output by the momentum text coding module;
merging the positive case text vector and the negative case text vector into a momentum text vector set;
determining a first contrast loss according to the training image feature vector and each vector in the momentum text vector set;
determining a second contrast loss according to the training text vector and each vector in the momentum image feature set;
determining a sum of the first and second comparison losses as a total loss;
and adjusting parameters of the image-text matching model according to the total loss.
In some embodiments, the momentum graph matching model is established by the following formula:
CAANm=m·CAANm+(1-m)CAAN
wherein, CAANmThe image-text matching model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is the image-text matching model.
In some embodiments, performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model includes:
inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text;
determining text loss of the predicted text and the candidate annotation text;
and adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.
In some embodiments, the inputting the candidate image into the intermediate model for image description to obtain a candidate predicted text includes:
extracting image features of the candidate images to obtain image feature vectors;
and generating an image description text for the image feature vector to obtain a candidate prediction text.
In some embodiments, the obtaining, according to the image-text similarity, a preset model hyper-parameter, and the CIDEr, a current reward value that can be obtained by the intermediate model for image description includes:
determining a current prize value obtainable by the image description of the intermediate model by the following formula:
reward=CIDEr+λS(I,T)
the reward is a current reward value obtained by image description of the intermediate model, the CIDER is the text similarity, the lambda is a preset model hyper-parameter, and the S (I, T) is the image-text similarity.
In a second aspect, an embodiment of the present application provides a training apparatus for an image description model, where the training apparatus includes:
a teletext pair training set acquisition unit configured to: acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;
a first training unit configured to: performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;
a second training unit configured to: performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the second training unit comprises:
an image description subunit configured to: inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;
a teletext similarity determination subunit configured to: determining the image-text similarity of the candidate image and the candidate predicted text;
a text similarity determination subunit configured to: determining CIDER of the candidate prediction text and the candidate annotation text;
a current bonus value obtaining subunit configured to: obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;
a parameter update gradient acquisition subunit configured to: acquiring a parameter updating gradient of the intermediate model according to the current reward value;
a parameter adjustment subunit configured to: and adjusting the parameters of the intermediate model by using the parameter updating gradient.
In some embodiments, the determining the image-text similarity between the candidate image and the candidate predicted text specifically includes:
performing text coding on the candidate predicted text to obtain a plurality of word vectors;
inputting the candidate image and the word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model completes training by using an extended training set and adopting an MOCO learning method, and the extended training set is a data set obtained by carrying out negative example extension on the image-text pair training set.
In some embodiments, the extended training set is determined by:
acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs;
obtaining a plurality of negative example images corresponding to the candidate annotation text to obtain a plurality of second negative example image-text pairs;
and all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form the extended training set.
In some embodiments, the obtaining of the multiple negative example texts corresponding to the candidate image to obtain multiple first negative example image-text pairs specifically includes:
respectively determining the image-text similarity of the candidate image and other label texts in the image-text pair training set except the candidate label text;
acquiring x unmatched texts according to the sequence of the image-text similarity from small to large;
scene rewriting is carried out on the candidate marked texts, and a plurality of rewritten texts are obtained;
acquiring y currency rewriting texts according to the sequence of the currency degrees from high to low;
merging the unmatched text and the fluency rewriting text into a plurality of negative case texts;
the candidate image and each negative example text form a first negative example image-text pair.
In some embodiments, the teletext matching model is trained using the following MOCO learning method:
inputting a training image into the image-text matching model to obtain a training image feature vector output by an image coding module in the image-text matching model, wherein the training image is an image in any training image-text pair in the extended training set;
inputting a training annotation text into the image-text matching model, and acquiring a training text vector output by a text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image;
inputting the training image into a momentum image-text matching model to obtain a positive example image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient;
inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module;
combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set;
inputting the training label text into the momentum image-text matching model, and acquiring a normal text vector output by a momentum text coding module in the momentum image-text matching model;
inputting each negative case text corresponding to the training image into the momentum image-text matching model, and acquiring a plurality of negative case text vectors output by the momentum text coding module;
merging the positive case text vector and the negative case text vector into a momentum text vector set;
determining a first contrast loss according to the training image feature vector and each vector in the momentum text vector set;
determining a second contrast loss according to the training text vector and each vector in the momentum image feature set;
determining a sum of the first and second comparison losses as a total loss;
and adjusting parameters of the image-text matching model according to the total loss.
In some embodiments, the momentum graph matching model is established by the following formula:
CAANm=m·CAANm+(1-m)CAAN
wherein, CAANmThe image-text matching model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is the image-text matching model.
In some embodiments, the word granularity training of the image description model by using the image-text pair training set is performed to obtain an intermediate model, and specifically, the word granularity training is performed by:
inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text;
determining text loss of the predicted text and the candidate annotation text;
and adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.
In some embodiments, the inputting the candidate image into the intermediate model for image description to obtain a candidate predicted text specifically includes:
extracting image features of the candidate images to obtain image feature vectors;
and generating an image description text for the image feature vector to obtain a candidate prediction text.
In some embodiments, the obtaining, according to the image-text similarity, a preset model hyper-parameter, and the CIDEr, a current reward value that can be obtained by the intermediate model performing image description specifically includes:
determining a current prize value obtainable by the image description of the intermediate model by the following formula:
reward=CIDEr+λS(I,T)
the reward is a current reward value obtained by image description of the intermediate model, the CIDER is the text similarity, the lambda is a preset model hyper-parameter, and the S (I, T) is the image-text similarity.
The technical scheme provided by the embodiment of the application has the following beneficial effects: aiming at any candidate image in the image-text pair training set, firstly inputting an intermediate model after word granularity training to obtain a candidate prediction text, then inputting the candidate image and the candidate prediction text into a pre-trained image-text matching model to determine image-text similarity, then combining CIDER of the candidate prediction text and the candidate annotation text and a preset model hyper-parameter to obtain a current reward value which can be obtained by the intermediate model for image description, obtaining a parameter updating gradient according to the current reward value, and further finishing fine tuning of the intermediate model at a sentence level. In the technical scheme provided by the embodiment of the application, the image similarity and the CIDER are jointly used as the reference standard of the image description model, so that the trained image description model can generate a prediction description text with higher matching degree with an actual image, and the prediction precision of the image description model can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the implementation manner in the related art, a brief description will be given below of the drawings required for the description of the embodiments or the related art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a diagram illustrating an architecture of a decoder in an image description model according to an embodiment of the present application;
fig. 2 shows a flowchart corresponding to a training method for an image description model according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an architecture of a graph-text matching model provided in an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a scene rewriting flow of candidate annotation texts according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an architecture of MOCO comparative learning provided by an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a training process of a graph-text matching model provided in an embodiment of the present application;
fig. 7 is a schematic overall flow chart corresponding to a training method for an image description model according to an embodiment of the present disclosure;
fig. 8 shows a schematic structural diagram of a training apparatus for an image description model according to an embodiment of the present application.
Detailed Description
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The following first explains an image description model in an embodiment of the present application with reference to the drawings.
The main algorithm framework in the image description model is an encoder-decoder (codec) framework, and an object detection algorithm (e.g., fast RCNN) or an image feature extraction algorithm (e.g., Resnet) is firstly adopted to encode an image, and then a decoder is used to decode the image to generate a text description corresponding to the image. An attention module is introduced in the decoding process, and the related part of the image is selected to be generated according to the current context through a top-down attention mechanism, so that the region corresponding to the image can be focused when the word corresponding to the image region is generated. Fig. 1 schematically shows an architecture diagram of a decoder in an image description model provided by an embodiment of the present application, and as shown in fig. 1, the decoder adopts a three-layer LSTM (long short memory network) structure, where the three-layer LSTM are LSTM0, LSTM1, and LSTM2, respectively, and an input of the i-th layer LSTM includes two parts xt i
Figure BDA0003352334320000061
Wherein the number i of layers of the LSTM is 0, 1 or 2. t and t-1 are the current time and the previous time, x, respectivelyt iIs the input of the ith layer at the current time,
Figure BDA0003352334320000062
for the output of the ith layer at the last moment, the output of the ith layer LSTM at the current moment
Figure BDA0003352334320000063
Fully connected is the Fully connected layer, ytSoftmax is a common calculation for neural network classification for the results of fully-connected layer output. In generating each word, the input of the LSTM takes into account both the already generated description information and the associated image information selected by the attention mechanism (i.e., attentive in fig. 1).
For LSTM0, the input for layer 0 at the current time may be represented by equation (1):
Figure BDA0003352334320000064
in the formula (1), the first and second groups,
Figure BDA0003352334320000065
is input to the 0 th layer at the present time]In order to splice the symbols for the vector,
Figure BDA0003352334320000066
the output of layer 2 (i.e. LSTM2) at the previous time,
Figure BDA0003352334320000067
average of K image region feature vectors, v, derived for fast-R-CNN or mesh featuresiThe feature vector of the image area is obtained; weΠtA word vector (W) for the current input worde∈RE×|∑|Is a word vector matrix, Π, about the dictionary | ∑ |)tOne hot encoding of the input word for time t). LSTM0 fuses the output from the previous time instant while fusing both image and text modality information. LSTM0 output is
Figure BDA0003352334320000068
Will be part of the next layer LSTM input.
For LSTM1, the input for layer 1 at the current time may be represented by equation (2):
Figure BDA0003352334320000071
in the formula (2), the first and second groups,
Figure BDA0003352334320000072
is an input for layer 1 at the present moment,
Figure BDA0003352334320000073
weighted average of K image region feature vectors derived for fast-R-CNN or grid features, viIs a feature vector of an image region, ai,tFor attention weight, the method depends on an image and a description which is generated before t time, indicates which targets in the image should be focused when words are generated at the t time, and filters a picture area needing to be focused according to the description generated before to generate a vocabulary at the current time; []In order to splice the symbols for the vector,
Figure BDA0003352334320000074
is the output of layer 0 at the current moment. LSTM1 output is
Figure BDA0003352334320000075
Will be part of the next layer LSTM input.
For LSTM2, the input for layer 2 at the current time may be represented by equation (3):
Figure BDA0003352334320000076
in the formula (3), the first and second groups,
Figure BDA0003352334320000077
is an input for layer 2 at the present time,
Figure BDA0003352334320000078
weighted average of K image region feature vectors derived for fast-R-CNN or grid features, viIs a feature vector of an image region,
Figure BDA0003352334320000079
is a fully connected network. LSTM2 output is
Figure BDA00033523343200000710
Finally, the output of the three-layer LSTM is connected
Figure BDA00033523343200000711
Obtaining y over a fully connected networktAnd then obtaining the conditional probability distribution of the current word through a Softmax layer, thus finally obtaining the probability distribution of the prediction sequence. The objective function of the image description model is to minimize the cross entropy loss of probability distribution of a prediction sequence and a labeling sequence, the model is fused with image information at each moment of generating description, and the model is guided to pay attention to the relevant area of an image for sentence generation through an attention mechanism in a multilayer LSTM extracting image and text information, so that the generated result is more fit with the image.
The image description model is trained according to cross entropy loss of a language model at word granularity, then the model is further optimized by reinforced learning, such as an SCST (Self-evaluation Sequence Training) method, on sentence granularity by using CIDER as a reward of the reinforced learning, wherein CIDER is used for regarding each sentence as a document and representing each sentence as a TF-IDF (term frequency-inverse document frequency) vector, and then the cosine similarity between a reference image description and the image description generated by the model is calculated. However, the repetition rate of the generated description and the labeled description words is high and is not necessarily similar to the semanteme of the sentence, so that the matching degree of the description generated by the image description model and the image is not high in some cases.
In order to improve the matching degree between a prediction description text generated by an image description model and an actual image, the embodiment of the application provides a training method of the image description model. Fig. 2 exemplarily shows a flow diagram corresponding to a training method for an image description model provided in an embodiment of the present application, and as shown in fig. 2, the method specifically includes the following steps:
201: and acquiring a picture-text pair training set.
Specifically, the graphic-text pair training set comprises a plurality of graphic-text pairs, and each graphic-text pair comprises an image and annotation text for describing the content of the image. The labeling text in the image-text pair is obtained by labeling the image by adopting a professional method, such as expert labeling.
Illustratively, one picture of playing basketball and the corresponding text label "two players play basketball on the course", i.e. one picture-text pair.
202: and performing word granularity training on the image description model by adopting the image-text pair training set to obtain an intermediate model.
In some embodiments, the image description model may be word-granularity trained using a training set using graphics and text to obtain an intermediate model:
firstly, inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in an image-text pair training set to obtain a predicted text.
Specifically, the image description includes image feature extraction and image description text generation. The candidate teletext pair comprises a candidate image and a candidate annotation text describing the content of the candidate image.
And inputting the candidate image into the image description model to be trained, and extracting image features to obtain an image feature vector. The image feature extraction comprises grid feature extraction, self-adaptive average pooling and ensemble averaging. The grid feature extraction may adopt a pre-trained Resnet101 model, and is not particularly limited.
Illustratively, the target results of the last layer obtained by extracting the grid features are respectively subjected to adaptive average pooling and integral average, and the adaptive average pooling can divide the target results into 7 × 7 grids to obtain 7 × 7 × 2048 dimensional partitioned area image features; the ensemble averaging may average 49 partitioned image features to obtain a 2048-dimensional image feature vector.
After the image feature extraction is performed, the image description text generation is performed again. When the image description text is generated, a word segmentation tool (for example, jieba) can be used for segmenting the candidate annotation text of the candidate image, and a plurality of word embedding vectors are obtained in a word embedding manner. And finally, obtaining a predicted text of the candidate image according to the word embedding vector and the image characteristic vector.
Then, text loss of the predicted text and the candidate annotation text is determined.
And finally, adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.
203: and performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until the model parameters of the intermediate model are converged. Wherein the target training step comprises:
2031: and inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text.
Specifically, the image description includes image feature extraction and image description text generation.
Inputting the candidate image into the intermediate model, firstly, carrying out image feature extraction on the candidate image to obtain an image feature vector, and then carrying out image description text generation on the image feature vector to obtain a candidate prediction text.
2032: and determining the image-text similarity of the candidate image and the candidate predicted text.
In some embodiments, the teletext similarity of the candidate image and the candidate predicted text may be determined by:
firstly, text coding is carried out on the candidate predicted texts to obtain a plurality of word vectors.
In some embodiments, text encoding may be implemented by bidirectional GRU (gated round robin) or LSTM (long-short memory network), and a pre-trained BERT (a pre-trained text characterization model) model may also be used.
And secondly, inputting the candidate image and the plurality of word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text.
The image-text matching model completes training by using an extended training set and an MOCO learning method, wherein the extended training set is a data set obtained by carrying out negative example extension on the training set by using images and texts.
The following describes a graph-text matching model provided in the embodiment of the present application.
Fig. 3 exemplarily shows an architecture diagram of a graph-text matching model provided by the embodiment of the present application, and as shown in fig. 3, a is an image coding branch of the graph-text matching model (also called CAAN model) provided by the embodiment of the present application, a candidate image is coded into an image feature vector through a Bottom-up attention mechanism (i.e., Bottom-up), the image feature vector is denoted by V, and K denotes the number of the image feature vectors. And B is a text coding branch of the CAAN model, and a Bi-GRU (bidirectional gated recurrent neural network) is adopted to perform text coding on the candidate predicted text to obtain a plurality of word vectors, wherein the word vectors are represented by U. C is a processing branch of the CAAN model, a similarity matrix H of words in the candidate prediction text and regions in the candidate image is calculated according to the image feature vector V and the word vector U, then the attention (i.e. inter-modal attention) among the modalities and the weight of the attention (i.e. intra-modal attention) in the modalities are sequentially calculated, the image feature vector and the word vector are superposed and mapped to the multi-modal joint space through the attention weight in the modalities and the attention weight among the modalities, and the picture vector in the multi-modal joint space is obtained
Figure BDA0003352334320000081
And text vectors
Figure BDA0003352334320000082
Wherein,
Figure BDA0003352334320000083
f (V, U) represents attention weight of each region of the image, and g (V, U) represents attention collision between each word of the text. Computing a picture vector
Figure BDA0003352334320000084
And text vectors
Figure BDA0003352334320000085
The similarity between the images is the image-text similarity. In fig. 3, Element-wise Sum indicates Element-wise summation, Element-wise Product indicates Element-wise Product, and Matrix Multiplication indicates Matrix Multiplication.
The image-text matching model provided by the embodiment of the application completes training by adopting an extended training set, wherein the extended training set is a data set obtained by carrying out negative example extension on the training set by the image-text.
In some embodiments, the augmented training set may be determined by:
the method comprises the following steps of firstly, obtaining a plurality of negative example texts corresponding to candidate images to obtain a plurality of first negative example image-text pairs.
In some embodiments, the first negative example teletext pair may be obtained by:
step one, respectively determining the image-text similarity of the candidate image and other label texts except the candidate label text in the image-text pair training set.
Specifically, the graph-text matching model shown in fig. 3 may be used to determine the graph-text similarity.
And step two, acquiring x unmatched texts according to the sequence of the image-text similarity from small to large.
That is, the x pieces of annotation text that most mismatch with the candidate image are regarded as unmatched texts.
And step three, carrying out scene rewriting on the candidate marked texts to obtain a plurality of rewritten texts.
Specifically, the candidate tagged text may be parsed through a text scene graph, where the text scene graph is composed of entities, attributes, and relationships, where the entities are represented by rectangles, the attributes are represented by diamonds, and the relationships are represented by circles. And then, performing near-meaning word replacement on each analyzed word by using a word dictionary, wherein the word dictionary comprises a solid dictionary, an attribute dictionary and a relation dictionary, and realizing scene rewriting of the candidate tagged text by replacing the corresponding word in the candidate tagged text with the near-meaning word in the corresponding dictionary to obtain a plurality of rewritten texts corresponding to the candidate image, for example, the near-meaning word of the noun 'player' can be 'player' or 'sports key', the near-meaning words of the adjective 'two' can be 'two', and the text 'two players play basketball on the court' can be rewritten into 'two sports keys playing basketball on the court'.
And step four, acquiring y passage rewriting texts according to the sequence of the passage degrees from high to low.
Illustratively, the candidate annotation text is "two players play basketball on the court", fig. 4 exemplarily shows a scene rewriting flow diagram of the candidate annotation text provided by the embodiment of the present application, as shown in fig. 4, the entities in the candidate annotation text are "player", "court" and "basketball", the relationships are "on", "up" and "playing", and the attribute is "two". The pre-established attribute dictionary comprises 'one', 'beautiful', 'handsome' and the like, the entity dictionary comprises 'man', 'worker', 'dog' and the like, and the relation dictionary comprises 'kick', 'sit', 'stand' and the like. Similar to the replacement of relation words, similar to the replacement of attribute words, similar to the replacement of relation words, are randomly selected from the entity dictionary for the entity 'athlete' to replace, for example, the athlete may be replaced by 'man'. And 1/3 probability is replaced for each word in the sentence, each sentence is randomly rewritten for many times, and the sentence with the first 100 degrees of compliance is selected as the scene rewriting text of the candidate label text.
And step five, combining the unmatched text and the smooth and correct rewriting text into a plurality of negative example texts.
And step six, forming a first negative example image-text pair by the candidate image and each negative example text.
And secondly, acquiring a plurality of negative example images corresponding to the candidate annotation texts to obtain a plurality of second negative example image-text pairs.
In some embodiments, the second negative example teletext pair may be obtained by:
step one, respectively determining the image-text similarity of the candidate annotation text and other images except the candidate image in the image-text pair training set.
Specifically, the graph-text matching model shown in fig. 3 may be used to determine the graph-text similarity.
And step two, acquiring x unmatched images according to the sequence of the image-text similarity from small to large.
That is, the x images that most mismatch with the candidate annotation text are taken as mismatched images.
It should be noted that, in order to make the number of images and texts consistent, the number of unmatched images and the number of unmatched texts should be the same.
And step three, randomly sampling pictures in the data set by other pictures and texts to obtain y sampled pictures.
It should be noted that the teletext pair data set and the teletext pair training set are two independent data sets.
It should be noted that, in order to make the number of images and texts consistent, the number of sampling pictures should be the same as the number of the texts to be rewritten.
And step four, combining the unmatched images and the sampling images into a plurality of negative example images.
And step five, forming a second negative example image-text pair by the candidate label text and each negative example image.
And thirdly, forming an extended training set by all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set.
By the embodiment, a large number of negative examples can be introduced on the basis of the image-text pair training set, so that the training set can be expanded without spending higher cost, the negative examples are easy to obtain, more variable scenes can be covered, and the training of a subsequent image-text matching model is facilitated.
Therefore, the CAAN model provided by the embodiment of the application takes the triple loss as a loss function during training, and in order to enhance the representation capability of the model and enable the model to have more fine image-text matching capability, MOCO comparative learning is introduced on the basis of CAAN so that the model can learn more negative examples.
Features learned using MOCO-based unsupervised learning structures for ImageNet classification can be overridden by surveillanceSupervising the learning performance. As shown in fig. 5, the MOCO contrasts and learns the structure, which is inspired by NLP task, by respectively encoding the picture data into query vector and key vector, i.e. query q and key queue k, where the queue includes a single positive sample and a plurality of negative samples. The feature representation is learned by contrast loss, the main line is still invariant: in the training process, the similarity between each query vector and the corresponding key vector is improved as much as possible, and the similarity between the query vector and the key vectors of other pictures is reduced. MOCO encodes data using two neural networks: an encoder and a momentum encoder. The encoder is responsible for encoding the abstract representation of the current instance and the momentum encoder is responsible for encoding the abstract representation of the plurality of instances (including the current instance). For the current example, the encoding results of its encoder and its own in the momentum encoder are maximized while the encoding results of the other examples in the momentum encoder are minimized. In FIG. 5, xqueryRepresents a query vector, and,
Figure BDA0003352334320000101
representing key vectors, encoder representing an encoder, momentum encoder representing a momentum encoder, similarity representing similarity, and coherent loss representing contrast loss.
The image-text matching model provided by the embodiment of the application is trained by utilizing an extended training set. Before training begins, a momentum image-text matching model, namely a secondary model of the image-text matching model, is established according to the image-text matching model and a preset proportionality coefficient.
The momentum image-text matching model comprises a momentum image coding module and a momentum text coding module.
In some embodiments, the momentum teletext matching model may be established by equation (4):
CAANm=m·CAANm+ (1-m) CAAN formula (4)
In formula (4), CAANmThe model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is an image-text matching model.
Fig. 6 schematically illustrates a training flow of the graph-text matching model provided in the embodiment of the present application, and as shown in fig. 6, the graph-text matching model provided in the embodiment of the present application may be trained by the following steps:
firstly, inputting a training image into a graph-text matching model to obtain a training image characteristic vector Z output by an image coding module in the graph-text matching modelj I
The training image is an image in any training image-text pair in the extended training set, and the image-text matching model further comprises a text coding module. CANNI(. for image coding modules in a graph-text matching model, CANNT(. -) represents a text encoding module in a graph-text matching model.
Secondly, inputting the training annotation text into the image-text matching model to obtain a training text vector Z output by a text coding module in the image-text matching modelj T
The training annotation text is a formal annotation text corresponding to the training image.
Thirdly, inputting the training image into the momentum image-text matching model to obtain the feature vector P of the positive example image output by the momentum image coding modulej I
In FIG. 6
Figure BDA0003352334320000111
A momentum image coding module in the image matching model for representing momentum,
Figure BDA0003352334320000112
and the Push represents that the image coding vectors obtained from the batch are pushed into an image coding object queue.
Fourthly, inputting each negative example training image of the training annotation text into the momentum image-text matching model to obtain a plurality of negative example image feature vectors N output by the momentum image coding modulet I
Fifthly, the feature vector P of the positive example image is processedj IAnd negative example image feature vector Nt ICollectively merged into a momentum image feature set QI
Sixthly, inputting the training label text into the momentum image-text matching model to obtain a normal text vector P output by the momentum text coding modulej T
Seventhly, inputting each negative example text of the training image into the momentum image-text matching model to obtain a plurality of negative example text vectors N output by the momentum text coding modulet T
Step eight, a positive example text vector P is addedj TAnd negative example text vector Nt TAre jointly merged into a momentum text vector set QT
The ninth step, according to the feature vector Z of the training imagej IAnd a set of momentum text vectors QTFor each vector, determining a first contrast loss LI2T
In some embodiments, the first contrast loss L may be determined by equation (5)I2T
Figure BDA0003352334320000113
In the formula (5), LI2TFor the first contrast loss, Zj IFor training the image feature vectors, Pj TAs a positive example text vector, NTIs the set of all negative example text vectors in the momentum text vector set, J is the J-th pair of data, J is the total number of data, tau is the temperature parameter, QTIs a set of momentum text vectors.
The tenth step, according to the training text vector Zj TAnd a momentum image feature set QIFor each vector, determining a second contrast loss LT2I
In some embodiments, the second contrast loss L may be determined by equation (6)T2I
Figure BDA0003352334320000114
In the formula (6), LT2IIs a secondLoss of contrast, Zj TTo train a text vector, Pj IIs a positive example image feature vector, NIIs the set of all negative example image feature vectors in the momentum image feature set, J is the jth pair of data, J is the total number of data, tau is the temperature parameter, QIIs a set of momentum image features.
A tenth step of comparing the first contrast loss LI2TLoss L in comparison with the secondT2IThe sum of (a) is determined as the total loss L.
And step ten, adjusting parameters of the image-text matching model according to the total loss L.
Through the embodiment, the CAAN model trained by the MOCO comparison learning method can greatly improve the representation capability of the model by learning a large number of easily obtained and significant negative examples, and the accuracy of image-text matching judgment is greatly improved.
2033: and determining CIDER of the candidate predicted text and the candidate annotation text.
2034: and obtaining the current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, the preset model hyper-parameter and the CIDER.
In some embodiments, the current prize value that can be obtained by the image description of the intermediate model can be determined by equation (7):
reward ═ CIDER + λ S (I, T) equation (7)
In the formula (7), reward is a current reward value obtained by image description of the intermediate model, CIDEr is text similarity CIDEr, λ is a preset model hyper-parameter, and S (I, T) is image-text similarity.
2035: and acquiring a parameter updating gradient of the intermediate model according to the current reward value.
Specifically, according to the current reward value, a parameter update gradient of the intermediate model can be acquired by using a reinforcement learning method, such as an SCST method.
The reinforcement learning method provided in the embodiments of the present application is explained below.
In the reinforcement learning method, LSTMs (each LSTM layer) in the algorithm framework responsible for describing generation may be used as Agent agents; taking an input image, a generated word, a word list and the like except the Agent as Environment, and interacting with the Agent; taking the cell State and the hidden State of the LSTMs, the attention weight and the like as a State State; taking the generated next word as an Action; regarding all the network parameters theta as a strategy p theta, and the strategy determines how to act (generate the next word) according to the State; when the end word EOS is generated, the Agent obtains a current reward value, the reward is calculated and obtained by comparing the generated sentence with the labeled group-route (basic fact) by using the CIDER score or/and the combination value of other indexes.
The reinforcement learning algorithm adopts a strategy gradient algorithm, and the training optimization goal is to maximize the expected reward, namely, minimize the negative value of the expected reward, which can be expressed by formula (8):
Figure BDA0003352334320000121
in practical application, the formula (8) can be simplified into the formula (9) through a single sample:
L(θ)=-r(ws),ws~pθformula (9)
In the formula (8) and the formula (9),
Figure BDA0003352334320000122
wt sthe word generated by sampling the model at the T-th moment, wherein T is more than or equal to 1 and less than or equal to T; l (θ) is a loss function, pθFor the probability of generating a word when the model parameter is θ, r (w)s) In order to sample the reward for a sentence,
Figure BDA0003352334320000123
is a desire for a reward.
The strategic gradient algorithm may be optimized for the objective function by calculating the gradient of the desired reward by equation (10):
Figure BDA0003352334320000124
in the formula (10), the first and second groups,
Figure BDA0003352334320000125
in order to be able to obtain a gradient of losses,
Figure BDA0003352334320000126
wt sthe word generated by sampling the model at the T-th moment, wherein T is more than or equal to 1 and less than or equal to T; p is a radical ofθFor the probability of generating a word when the model parameter is θ, r (w)s) In order to sample the reward for a sentence,
Figure BDA0003352334320000127
in order to be a desire for a reward,
Figure BDA0003352334320000128
indicating the gradient.
In actual calculation, the expected value cannot be directly calculated, so that the expected value is calculated by adopting monte carlo sampling, and each training sample in one batch is calculated by the formula (11):
Figure BDA0003352334320000129
in the formula (11), the reaction mixture,
Figure BDA00033523343200001210
to lose the gradient, r (w)s) In order to sample the reward for a sentence,
Figure BDA00033523343200001211
representing the gradient, pθThe probability of generating a word when the model parameter is theta,
Figure BDA00033523343200001212
wt st is greater than or equal to 1 and less than or equal to T for the word generated for the model sample at time T.
The above calculation formula causes a problem when a certain sample is sampledAt the time of sampling, r (w)s) Positive, in order to bring the gradient closer to the optimal direction, p is assignedθ(ws) And also becomes larger, the probability that the sample will be sampled in the future becomes larger, and the probability that other samples will be sampled becomes smaller, which is just because other samples are not sampled initially, obviously unfair, and it is also possible that an optimal solution cannot be found. Therefore, one method commonly used at present is to add a baseline term b, which is expressed by formula (12):
Figure BDA00033523343200001213
b may be any function as long as it satisfies the condition that the desired prize value is not changed. To satisfy this condition, b is required to be independent of the action, and is specifically derived by equation (13):
Figure BDA00033523343200001214
increasing b does not change the desired reward gradient calculation, but may decrease the variance estimate of the desired reward gradient. For each sample, the desired reward gradient may be approximated by equation (14):
Figure BDA00033523343200001215
in the formula (12), the formula (13) and the formula (14),
Figure BDA00033523343200001216
in order to be able to obtain a gradient of losses,
Figure BDA00033523343200001217
wt sthe word generated by sampling the model at the T-th moment, wherein T is more than or equal to 1 and less than or equal to T; p is a radical ofθFor the probability of generating a word when the model parameter is θ, r (w)s) In order to sample the reward for a sentence,
Figure BDA0003352334320000131
in order to be a desire for a reward,
Figure BDA0003352334320000132
representing the gradient, b is an arbitrary function.
The chain rule described by the following equation (15) is used:
Figure BDA0003352334320000133
wherein, in the formula (15)
Figure BDA0003352334320000134
Can be expressed by equation (16):
Figure BDA0003352334320000135
since the core idea of SCST is to use the currently trained model for the test set to predict the obtained reward value as basemine, equation (16) becomes equation (17):
Figure BDA0003352334320000136
in the formula (15), the formula (16) and the formula (17),
Figure BDA0003352334320000137
predicting the value of the reward earned for the test set for the currently trained model, StIs the softmax activation function and,
Figure BDA0003352334320000138
for the gradient of the loss, L (θ) is the loss function, θ is the gradient variable, wtWords generated for the current time obtained by greedy sampling, htIn the hidden state of the LSTM network,
Figure BDA0003352334320000139
wt swords generated for model sampling at time T, T being greater than or equal to 1 and less than or equal to T, pθFor the probability of generating a word when the model parameter is θ, r (w)s) A reward for sampling sentences.
The reward value sampled from the current model can be higher than
Figure BDA00033523343200001310
The probability of sampling the sample will increase and vice versa decrease.
The current model is used for test set prediction using greedy search strategy, which is expressed by equation (18):
Figure BDA00033523343200001311
in the formula (18), the first and second groups,
Figure BDA00033523343200001312
vector shape of wtIn the formula (II), the compound (II) is shown in the specification,wtfor words generated at the current time obtained by greedy sampling, argmax represents the maximum value, p represents the probability, htIs a hidden state of the LSTM network.
2036: and updating the parameters of the gradient adjustment intermediate model by using the parameters.
Through the embodiment, the graph-text similarity calculated by the CAAN model pre-trained by the MOCO method is added into the reward for optimization by adopting the reinforcement learning method, so that the accuracy and the recall rate of the sentences can be kept, the generated sentences can describe the general contents of the pictures, more detailed description can be provided, and the matching degree of the generated prediction texts and the pictures is greatly improved.
To more clearly illustrate the embodiment of the present application, fig. 7 exemplarily shows an overall flowchart corresponding to a training method for an image description model provided by the embodiment of the present application, as shown in fig. 7, the image description model includes an image feature extraction module, a word embedding module and a scenario generation module, the image feature vectors are obtained by extracting features of candidate image input image feature extraction modules, the image feature vectors and words Generated by the word embedding module are input into the scenario generation module together, a predicted text Generated scenario (for example, the predicted text is "two players play basketball on a stadium") is obtained, a text similarity CIDEr between the predicted text and a labeled text group try (for example, the labeled text is "two players play basketball on a basketball court") is determined, and at the same time, the predicted text is encoded by the text encoding module and then input into a CAAN model of the MOCO contrast learning method together with the image feature vectors (it should be noted that, the structure of the CAAN model trained by the MOCO contrast learning method is completely the same as that of the image-text matching model shown in FIG. 3), image-text Similarity Simiarity of the candidate image and the predicted text is obtained, the image-text Similarity Simiarity and the text Similarity CIDER jointly determine a current reward value reward, finally, a parameter adjusting gradient of the image description model is generated according to the current reward value reward through the SCST module, further parameter adjustment of the image description model is achieved, and the steps are repeated continuously until the parameters of the image description model are converged, and the training is finished. In fig. 7, FC0 and FC1 both represent fully connected layers.
According to the embodiment, fine adjustment of the intermediate model in sentence granularity is achieved by executing a target training step on the intermediate model obtained after word granularity training, wherein the target training step comprises the steps of inputting the candidate image into the intermediate model to obtain a candidate predicted text, determining the image-text similarity between the candidate image and the candidate predicted text, then combining the CIDER of the candidate predicted text and the candidate marked text and a preset model hyper-parameter to obtain a current reward value which can be obtained by image description of the intermediate model, obtaining a parameter updating gradient according to the current reward value, and further adjusting model parameters. In the technical scheme provided by the embodiment of the application, a reinforcement learning method is adopted, and the image-text similarity obtained by a pre-trained CAAN model is added into the reward of the SCST, so that the problem that the original SCST is only optimized by CIDER to cause too single generation description can be solved, and the image-text matching model capable of learning a large number of negative examples is used as a teaching model to effectively improve the description generation capability of the image description model and enable the generated description to be more matched with the image.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Fig. 8 schematically illustrates a structural diagram of a training apparatus for an image description model according to an embodiment of the present application. As shown in fig. 8, the apparatus has a function of implementing the training method of the image description model, and the function may be implemented by hardware, or may be implemented by hardware executing corresponding software. The apparatus may include: a teletext pair training set acquisition unit 801, a first training unit 802 and a second training unit 803.
A graph-text pair training set acquisition unit 801 configured to: and acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image.
A first training unit 802 configured to: and performing word granularity training on the image description model by adopting the image-text pair training set to obtain an intermediate model.
A second training unit 803 configured to: performing a target training step on the intermediate model by using any one of the candidate image-text pairs in the training set until the model parameters of the intermediate model converge, where the candidate image-text pairs include candidate images and candidate annotation texts, and the second training unit 803 includes:
an image description subunit 8031 configured to: and inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation.
A teletext similarity determination subunit 8032 configured to: and determining the image-text similarity of the candidate image and the candidate predicted text.
A text similarity determination subunit 8033 configured to: and determining CIDER of the candidate predicted text and the candidate annotation text.
A current prize value obtaining subunit 8034 configured to: and obtaining the current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, the preset model hyper-parameter and the CIDER.
A parameter update gradient acquisition subunit 8035 configured to: and acquiring a parameter updating gradient of the intermediate model according to the current reward value.
A parameter adjusting subunit 8036 configured to: and updating the parameters of the gradient adjustment intermediate model by using the parameters.
In some embodiments, determining the image-text similarity between the candidate image and the candidate predicted text specifically includes:
and performing text coding on the candidate predicted text to obtain a plurality of word vectors.
Inputting the candidate image and a plurality of word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model utilizes an extended training set and adopts an MOCO learning method to complete training, and the extended training set is a data set obtained by carrying out negative example extension on the training set by the images and the texts.
In some embodiments, the augmented training set is determined by:
and acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs.
And acquiring a plurality of negative example images corresponding to the candidate annotation texts to obtain a plurality of second negative example image-text pairs.
All the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form an extended training set.
In some embodiments, obtaining a plurality of negative example texts corresponding to the candidate image to obtain a plurality of first negative example image-text pairs specifically includes:
and respectively determining the image-text similarity of the candidate image and other label texts except the candidate label texts in the image-text pair training set.
And acquiring x unmatched texts according to the sequence of the image-text similarity from small to large.
And carrying out scene rewriting on the candidate marked texts to obtain a plurality of rewritten texts.
And acquiring y passing rewriting texts according to the sequence of passing degrees from high to low.
And combining the unmatched text and the fluency rewriting text into a plurality of negative example texts.
The candidate image and each negative example text form a first negative example image-text pair.
In some embodiments, the teletext matching model is trained using the following MOCO learning method:
inputting the training image into the image-text matching model, and obtaining the training image characteristic vector output by the image coding module in the image-text matching model, wherein the training image is the image in any training image-text pair in the extended training set.
Inputting the training annotation text into the image-text matching model, and acquiring a training text vector output by the text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image.
Inputting the training image into a momentum image-text matching model, and acquiring a normal image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient.
And inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module.
And combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set.
And inputting the training label text into the momentum text matching model, and acquiring a normal text vector output by the momentum text coding module in the momentum text matching model.
And inputting each negative example text corresponding to the training image into the momentum image-text matching model to obtain a plurality of negative example text vectors output by the momentum text coding module.
And combining the positive case text vector and the negative case text vector into a momentum text vector set.
A first contrast loss is determined based on the training image feature vectors and each vector in the set of momentum text vectors.
A second contrast loss is determined based on the training text vector and each vector in the momentum image feature set.
The sum of the first and second contrast losses is determined as the total loss.
And adjusting parameters of the image-text matching model according to the total loss.
In some embodiments, the momentum graph matching model is established by the following formula:
CAANm=m·CAANm+(1-m)CAAN
wherein, CAANmThe model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is an image-text matching model.
In some embodiments, word granularity training is performed on the image description model by using a training set using graphics and text to obtain an intermediate model, specifically:
and inputting the candidate images into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text.
And determining text loss of the predicted text and the candidate marked-up text.
And adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.
In some embodiments, the candidate image is input to an intermediate model for image description to obtain a candidate predicted text, specifically:
and extracting image features of the candidate images to obtain image feature vectors.
And generating an image description text for the image feature vector to obtain a candidate prediction text.
In some embodiments, the current reward value obtained by the image description of the intermediate model is obtained according to the image-text similarity, the preset model hyper-parameter and the CIDEr, and specifically:
the current reward value obtained by the image description of the intermediate model is determined by the following formula:
reward=CIDEr+λS(I,T)
wherein reward is a current reward value obtained by image description of the intermediate model, CIDER is text similarity, lambda is a preset model hyper-parameter, and S (I, T) is image-text similarity.
The technical scheme provided by the embodiment of the application has the following beneficial effects: aiming at any candidate image in the image-text pair training set, firstly inputting an intermediate model after word granularity training to obtain a candidate prediction text, then inputting the candidate image and the candidate prediction text into a pre-trained image-text matching model to determine image-text similarity, then combining CIDER of the candidate prediction text and the candidate annotation text and a preset model hyper-parameter to obtain a current reward value which can be obtained by the intermediate model for image description, obtaining a parameter updating gradient according to the current reward value, and further finishing fine tuning of the intermediate model at a sentence level. In the technical scheme provided by the embodiment of the application, the image similarity and the CIDER are jointly used as the reference standard of the image description model, so that the trained image description model can generate a prediction description text with higher matching degree with an actual image, and the prediction precision of the image description model can be improved.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A training method for an image description model, the training method comprising:
acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;
performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;
performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the target training step comprises the following steps:
inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;
determining the image-text similarity of the candidate image and the candidate predicted text;
determining CIDER of the candidate prediction text and the candidate annotation text;
obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;
acquiring a parameter updating gradient of the intermediate model according to the current reward value;
and adjusting the parameters of the intermediate model by using the parameter updating gradient.
2. The training method of claim 1, wherein the determining the teletext similarity of the candidate image and the candidate predictive text comprises:
performing text coding on the candidate predicted text to obtain a plurality of word vectors;
inputting the candidate image and the word vectors into a pre-constructed image-text matching model to obtain the image-text similarity of the candidate image and the candidate predicted text, wherein the image-text matching model completes training by using an extended training set and adopting an MOCO learning method, and the extended training set is a data set obtained by carrying out negative example extension on the image-text pair training set.
3. Training method according to claim 2, wherein the extended training set is determined by:
acquiring a plurality of negative example texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs;
obtaining a plurality of negative example images corresponding to the candidate annotation text to obtain a plurality of second negative example image-text pairs;
and all the first negative example image-text pairs, all the second negative example image-text pairs and the image-text pair training set jointly form the extended training set.
4. The training method of claim 3, wherein the obtaining a plurality of negative examples texts corresponding to the candidate images to obtain a plurality of first negative example image-text pairs comprises:
respectively determining the image-text similarity of the candidate image and other label texts in the image-text pair training set except the candidate label text;
acquiring x unmatched texts according to the sequence of the image-text similarity from small to large;
scene rewriting is carried out on the candidate marked texts, and a plurality of rewritten texts are obtained;
acquiring y currency rewriting texts according to the sequence of the currency degrees from high to low;
merging the unmatched text and the fluency rewriting text into a plurality of negative case texts;
the candidate image and each negative example text form a first negative example image-text pair.
5. A training method as claimed in claim 3, wherein the teletext matching model is trained using the following MOCO learning method:
inputting a training image into the image-text matching model to obtain a training image feature vector output by an image coding module in the image-text matching model, wherein the training image is an image in any training image-text pair in the extended training set;
inputting a training annotation text into the image-text matching model, and acquiring a training text vector output by a text coding module in the image-text matching model, wherein the training annotation text is a formal annotation text corresponding to the training image;
inputting the training image into a momentum image-text matching model to obtain a positive example image feature vector output by a momentum image coding module in the momentum image-text matching model, wherein the momentum image-text matching model is a secondary model of the image-text matching model and is established according to the image-text matching model and a preset proportionality coefficient;
inputting each negative example training image corresponding to the training annotation text into the momentum image-text matching model, and acquiring a plurality of negative example image feature vectors output by the momentum image coding module;
combining the positive example image feature vector and the negative example image feature vector into a momentum image feature set;
inputting the training label text into the momentum image-text matching model, and acquiring a normal text vector output by a momentum text coding module in the momentum image-text matching model;
inputting each negative case text corresponding to the training image into the momentum image-text matching model, and acquiring a plurality of negative case text vectors output by the momentum text coding module;
merging the positive case text vector and the negative case text vector into a momentum text vector set;
determining a first contrast loss according to the training image feature vector and each vector in the momentum text vector set;
determining a second contrast loss according to the training text vector and each vector in the momentum image feature set;
determining a sum of the first and second comparison losses as a total loss;
and adjusting parameters of the image-text matching model according to the total loss.
6. The training method of claim 5, wherein the momentum graph matching model is established by the following formula:
CAANm=m·CAANm+(1-m)CAAN
wherein, CAANmThe image-text matching model is a momentum image-text matching model, m is a preset proportionality coefficient, and CAAN is the image-text matching model.
7. The training method of claim 1, wherein performing word granularity training on the image description model using the image-text pair training set to obtain an intermediate model comprises:
inputting a candidate image into an image description model for image description aiming at any candidate image-text pair in the image-text pair training set to obtain a predicted text;
determining text loss of the predicted text and the candidate annotation text;
and adjusting parameters of the image description model according to the text loss, and repeating the training process until the parameters of the image description model are converged.
8. The training method of claim 1, wherein the inputting the candidate images into the intermediate model for image description to obtain candidate predicted texts comprises:
extracting image features of the candidate images to obtain image feature vectors;
and generating an image description text for the image feature vector to obtain a candidate prediction text.
9. The training method according to claim 1, wherein obtaining the current reward value obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER comprises:
determining a current prize value obtainable by the image description of the intermediate model by the following formula:
reward=CIDEr+λS(I,T)
the reward is a current reward value obtained by image description of the intermediate model, the CIDER is the text similarity, the lambda is a preset model hyper-parameter, and the S (I, T) is the image-text similarity.
10. An apparatus for training an image description model, the apparatus comprising:
a teletext pair training set acquisition unit configured to: acquiring a graph-text pair training set, wherein the graph-text pair training set comprises a plurality of graph-text pairs, and each graph-text pair comprises an image and an annotation text for describing the content of the image;
a first training unit configured to: performing word granularity training on the image description model by using the image-text pair training set to obtain an intermediate model;
a second training unit configured to: performing a target training step on the intermediate model by using any candidate image-text pair in the image-text pair training set until model parameters of the intermediate model converge, wherein the candidate image-text pair comprises a candidate image and a candidate label text, and the second training unit comprises:
an image description subunit configured to: inputting the candidate image into the intermediate model for image description to obtain a candidate prediction text, wherein the image description comprises image feature extraction and image description text generation;
a teletext similarity determination subunit configured to: determining the image-text similarity of the candidate image and the candidate predicted text;
a text similarity determination subunit configured to: determining CIDER of the candidate prediction text and the candidate annotation text;
a current bonus value obtaining subunit configured to: obtaining a current reward value which can be obtained by the image description of the intermediate model according to the image-text similarity, a preset model hyper-parameter and the CIDER;
a parameter update gradient acquisition subunit configured to: acquiring a parameter updating gradient of the intermediate model according to the current reward value;
a parameter adjustment subunit configured to: and adjusting the parameters of the intermediate model by using the parameter updating gradient.
CN202111341668.9A 2021-11-12 2021-11-12 Training method and training device for image description model Pending CN114090815A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111341668.9A CN114090815A (en) 2021-11-12 2021-11-12 Training method and training device for image description model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111341668.9A CN114090815A (en) 2021-11-12 2021-11-12 Training method and training device for image description model

Publications (1)

Publication Number Publication Date
CN114090815A true CN114090815A (en) 2022-02-25

Family

ID=80300434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111341668.9A Pending CN114090815A (en) 2021-11-12 2021-11-12 Training method and training device for image description model

Country Status (1)

Country Link
CN (1) CN114090815A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647789A (en) * 2022-03-31 2022-06-21 腾讯科技(深圳)有限公司 Method for determining recommendation model and related device
CN114818654A (en) * 2022-05-10 2022-07-29 中国工商银行股份有限公司 Text processing method, text feature extraction method, device, equipment and medium
CN114973226A (en) * 2022-05-13 2022-08-30 上海大学 Training method for text recognition system in natural scene of self-supervision contrast learning
CN115035304A (en) * 2022-05-31 2022-09-09 中国科学院计算技术研究所 Image description generation method and system based on course learning
CN115525281A (en) * 2022-10-12 2022-12-27 广州宏天软件股份有限公司 Form interactive graph display and selection method
CN116108156A (en) * 2023-04-07 2023-05-12 四川大学 Topic law retrieval method based on cyclic association robust learning
CN116580283A (en) * 2023-07-13 2023-08-11 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium
CN117235534A (en) * 2023-11-13 2023-12-15 支付宝(杭州)信息技术有限公司 Method and device for training content understanding model and content generating model
CN117727044A (en) * 2023-03-29 2024-03-19 书行科技(北京)有限公司 Training method, device, equipment and storage medium of attribute identification model
WO2024187949A1 (en) * 2023-03-15 2024-09-19 华为技术有限公司 Image description generation method and electronic device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647789A (en) * 2022-03-31 2022-06-21 腾讯科技(深圳)有限公司 Method for determining recommendation model and related device
CN114818654A (en) * 2022-05-10 2022-07-29 中国工商银行股份有限公司 Text processing method, text feature extraction method, device, equipment and medium
CN114973226A (en) * 2022-05-13 2022-08-30 上海大学 Training method for text recognition system in natural scene of self-supervision contrast learning
CN114973226B (en) * 2022-05-13 2024-09-24 上海大学 Training method for text recognition system in self-supervision contrast learning natural scene
CN115035304A (en) * 2022-05-31 2022-09-09 中国科学院计算技术研究所 Image description generation method and system based on course learning
CN115525281A (en) * 2022-10-12 2022-12-27 广州宏天软件股份有限公司 Form interactive graph display and selection method
CN115525281B (en) * 2022-10-12 2023-06-27 广州宏天软件股份有限公司 Form interactive graph display and selection method
WO2024187949A1 (en) * 2023-03-15 2024-09-19 华为技术有限公司 Image description generation method and electronic device
CN117727044A (en) * 2023-03-29 2024-03-19 书行科技(北京)有限公司 Training method, device, equipment and storage medium of attribute identification model
CN116108156B (en) * 2023-04-07 2023-06-09 四川大学 Topic law retrieval method based on cyclic association robust learning
CN116108156A (en) * 2023-04-07 2023-05-12 四川大学 Topic law retrieval method based on cyclic association robust learning
CN116580283B (en) * 2023-07-13 2023-09-26 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium
CN116580283A (en) * 2023-07-13 2023-08-11 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium
CN117235534A (en) * 2023-11-13 2023-12-15 支付宝(杭州)信息技术有限公司 Method and device for training content understanding model and content generating model
CN117235534B (en) * 2023-11-13 2024-02-20 支付宝(杭州)信息技术有限公司 Method and device for training content understanding model and content generating model

Similar Documents

Publication Publication Date Title
CN114090815A (en) Training method and training device for image description model
CN112487182B (en) Training method of text processing model, text processing method and device
CN107133211B (en) Composition scoring method based on attention mechanism
Li et al. Visual question generation as dual task of visual question answering
CN109918510B (en) Cross-domain keyword extraction method
CN111858931B (en) Text generation method based on deep learning
CN110737801A (en) Content classification method and device, computer equipment and storage medium
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN110489567B (en) Node information acquisition method and device based on cross-network feature mapping
CN109919221B (en) Image description method based on bidirectional double-attention machine
Cascianelli et al. Full-GRU natural language video description for service robotics applications
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
Mohamad Nezami et al. Towards generating stylized image captions via adversarial training
CN110807069B (en) Entity relationship joint extraction model construction method based on reinforcement learning algorithm
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN113220891B (en) Method for generating confrontation network image description based on unsupervised concept-to-sentence
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN114398976A (en) Machine reading understanding method based on BERT and gate control type attention enhancement network
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN113822125A (en) Processing method and device of lip language recognition model, computer equipment and storage medium
CN112527993A (en) Cross-media hierarchical deep video question-answer reasoning framework
CN112116685A (en) Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism
Zhu et al. Multiscale temporal network for continuous sign language recognition
Guo et al. Matching visual features to hierarchical semantic topics for image paragraph captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination