CN111708904A

CN111708904A - Few-sample visual story narration method based on theme adaptation and prototype coding

Info

Publication number: CN111708904A
Application number: CN202010857191.9A
Authority: CN
Inventors: 庄越挺; 浦世亮; 汤斯亮; 李嘉成; 吴飞; 肖俊; 李玺; 张世峰; 任文奇; 陆展鸿
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-09-25

Abstract

The invention discloses a few-sample visual story narration method based on theme adaptation and prototype coding. The method comprises the steps that firstly, a data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The invention combines prototype coding and meta-learning methods, and the constructed model has the capability of quickly adapting to the theme and can better generate the story text description which accords with the theme of the image sequence.

Description

Few-sample visual story narration method based on theme adaptation and prototype coding

Technical Field

The invention relates to visual languages, in particular to a few-sample visual story narration method based on theme adaptation and prototype coding.

Background

Visual and Language (Vision and Language) is an interdisciplinary discipline that integrates computer Vision with natural Language processing. With the great breakthrough brought by the deep learning technology in two fields, the cross-modal tasks such as image summarization, image question answering, image retrieval and the like generate profound results. Recently, researchers have further begun exploring Visual Storytelling tasks (Visual Storytelling) for generating narrative stories from image sequences.

In a visual narration task, given a sequence of images with contextual associations, the model is asked to output a story described in natural language with a narrative style. The task characteristics of the visual story narration task require that the model not only can correctly identify objects and attributes thereof in the images, but also can fully understand the association among a plurality of images, mine the implicit information in the image sequence from the aspects of time and space, and carry out proper inference according to the change of visual contents, thereby finally generating a coherent and smooth narrative story. Visual narrative techniques can generate descriptions for sequences of images taken by users for quick sharing to social media or for later retrieval. As a more complex cross-modal task, visual narration may also reflect the level of the intelligence's ability to understand image sequences and organize natural language.

The mainstream visual narration model is inspired by an image abstract model at present, adopts a hierarchical encoder-decoder framework and is trained based on a supervised learning mode. Much of the previous work has focused on designing complex model structures, which typically require large amounts of manually annotated data. However, the labeling of visual narration tasks is expensive and complex, and therefore cannot annotate large amounts of new data, which becomes a bottleneck for supervised learning methods. On the other hand, previous work studies on topic models have shown that topics in the real world generally follow a long tail distribution, which means that there are many new topics not covered in the training data set in the practical application scenario, and the number of samples of these new topics is rare. Therefore, the traditional supervision model is not suitable for a new theme with rare samples, and the visual story narration under the scene with few samples is considered to be closer to the application scene in real life.

Disclosure of Invention

The invention aims to provide a few-sample visual story-narrating method based on theme adaptation and prototype coding, aiming at the problems that themes are distributed in a long tail form in a visual story-narrating task, and the number of new theme samples is rare and is not suitable for a traditional supervision model.

In order to achieve the above purpose, the invention specifically adopts the following technical scheme:

a method of small sample visual narration based on theme adaptation and prototype coding comprising the steps of:

s1: dividing a visual story data set according to themes, sampling a batch of themes in each training turn, and dividing each theme into a support set and a query set;

s2: respectively coding story texts and image sequences in the vision story samples supporting concentration used for training into story characteristics and image sequence characteristics, and storing for later use;

s3: extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining story features and image sequence features of the support set in S2;

s4: decoding the combined features of the image sequence features and the prototype vectors obtained in the S3 into a story descriptive text through a story decoder with an attention mechanism;

s5: optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method;

s6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.

Based on the technical scheme, the steps of the invention can be further realized by adopting the following specific mode.

Preferably, the specific method of S1 is as follows:

s11: dividing the visual story data set according to subjects, and training and sampling each roundNA theme and sample 2 from each themeKA sample of visual stories whereinKOne as support set for less sample training, the restKAnd the query set is used for verifying the learning effect of the few samples.

Preferably, the specific sub-steps of S2 are as follows:

s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS _spt={s₁,…,s_K}；

S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtainImage sequence feature set

Features of each image sequence

Semantic information characterizing an image sequence.

Further, in the S22, for each image sequence in the support setA _i={a ₁,…,a _m}，a _jIs shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea _jIs characterized in thatf _jObtaining a set of image characteristics corresponding to the image sequenceF _I= {f ₁,…,f _mAnd will be assembledF _IEach feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v₁,…,v_mIn which v is_jPresentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit_mAs a feature of the image sequence characterizing the image sequence

。

Further, the specific sub-steps of S3 are as follows:

s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV _qry={v'₁,…,v'_mV 'therein'_jIndicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment'_mAs a representation of the image sequenceImage sequence feature of

；

S32: a story prototype vector is further calculated by an attention mechanism, incorporating the story features and image sequence features of the support set described in S2:

wherein,proto∈R ^dk1×a prototype vector is represented by a vector of a prototype,d _krepresenting the degree of the feature, softmax (·) represents the softmax function,

the superscript T of (a) denotes transpose.

Further, the specific sub-steps of S4 are as follows:

s41: splicing the prototype vector with the image sequence features for initializing the hidden layer state of the gating circulation unit of the story decoderh ₀；

S42: according to the hidden layer state of the last moment of the gate control circulation unith _t-1And words predicted at the previous timewPredicting the current timetHidden layer state ofh _t；

S43: by means of attention, computingtVisual context characteristics at the moment:

wherein,c _tto representtVisual context characteristics of the moment;

s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:

wherein,p _wtrepresenting predictionstThe probability distribution of the words at the time of day,W _proj∈R ^{dk dk2×}andb _proj∈R ^dk1×respectively, a mapping matrix and a bias coefficient obtained by learning.

Further, the specific sub-steps of S5 are as follows:

s51: constructing a visual story narration model by taking S2-S4 as a framework, and regarding the samples in S11NEach of the themes

Adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to each theme and correspond to each theme;

s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ。

Further, in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:

wherein,θ _i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f _θexpressed in the initial parametersθThe model of the following model is shown,

is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,

representation of parametersθDerivation is carried out;

in the step S52, the initial parameters of the model are further optimizedθThe overall loss function used is:

wherein, E [. C]The display of the user can be expected to be,

in order to distribute the subject matter of all the subjects,

representing topics sampled from all topics.

Further, the specific sub-steps of S6 are as follows:

s61: in the stage of model speculation, parameters are adjusted according to the support set of the new theme by using the gradient descent method described in S51, so that the parameters of the visual narration model are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ'；

S62: using with new parametersθ'Visual story narration modelf _θ'Story description text is generated for the sequence of images at the guessing stage.

Preferably, the text encoder, the visual semantic encoder and the story decoder are all a gate-controlled cyclic unit-based cyclic neural network.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with a visual story narration method based on supervised learning, the visual story narration method based on meta-learning has better theme generalization capability. According to the invention, the model parameters can be rapidly adjusted through a small number of samples of the new theme, so that the visual story description which is more consistent with the theme is generated, the dependence of the model on the number of samples of the new theme is reduced, and the method is more suitable for practical application scenes.

2. According to the invention, a small amount of training samples of the new theme are coded into prototypes and are provided for the visual narration model as reference in the guessing stage, so that the model can fully capture the visual characteristics and language style of the new theme, and the generated story description has better relevance and expressiveness.

Drawings

Fig. 1 is a flow diagram of a method of small sample visual narration based on topic adaptation and prototype coding.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

In a preferred embodiment of the present invention, as shown in fig. 1, a sample-less visual narration method based on theme adaptation and prototype coding is provided. The basic concept of the invention is that firstly, a data set is divided according to subjects, a batch of subjects are sampled in each training round, and each subject is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The whole framework of the invention can be divided into a prototype coding part and a visual context coding part, wherein a text coder, a visual semantic coder and a story decoder are all a gate-controlled cyclic unit (GRU) -based cyclic neural network.

The following is a description of specific steps of the few-sample visual narration method based on theme adaptation and prototype coding, and the specific steps are as follows:

s1: the visual story data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set.

In this embodiment, the specific partitioning method is as follows:

s11: the visual story data set is divided by subject, eachWheel training samplingNA theme and sample 2 from each themeKA sample of visual stories whereinKAs a support setD _sptFor training with few samples, remainderKIs taken as a query setD _qryThe method is used for verifying the learning effect of the few samples. Wherein,N、Kthe specific value of (a) can be determined according to the specific situation of the data set, so as to meet the training requirement of the model. For example, in a set of wedding-themed visual story samples, each sample contains a picture and its corresponding story text Truth value, i.e., group-Truth, for subsequent model training.

S2: and respectively coding the story text and the image sequence in the vision story sample supporting the concentration for training into story characteristics and image sequence characteristics, and storing for later use.

In this embodiment, the step S2 may be implemented by the following steps:

S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtain an image sequence characteristic set

Features of each image sequence

Semantic information characterizing an image sequence.

Wherein for each image sequence in the support setA _i={a ₁,…,a _m}，a _jIs shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea _jIs characterized in thatf _jObtaining the corresponding image of the image sequenceSet of featuresF _I= {f ₁,…,f _mAnd will be assembledF _IEach of the characteristics off _jSequentially sending the images into a visual semantic encoder based on a gating cycle unit to obtain time sequence visual semantic features of the image sequenceV={v₁,…,v_mIn which v is_jPresentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit_mAs a feature of the image sequence characterizing the image sequence

。mThe specific value of (a) can be adjusted as required, and in this embodiment, the setting ism=5。

S3: and (4) extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining the feature of supporting concentrated stories and the image sequence features in S2.

In this embodiment, the step S3 may be implemented by the following steps:

s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV _qry={v'₁,…,v'_mV 'therein'_jIndicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment'_mAs a feature of the image sequence characterizing the image sequence

；

S32: by means of an attention mechanism, a story prototype vector is further calculated by combining story features and image sequence features of the support set obtained in S2:

∈R ^dk1×is an image sequence feature of a single image sequence in the query set,

∈R ^{K dk×}a set of image sequence features representing a support set, a superscript T representing a transpose,S _spt∈R ^{K dk×}a set of story features representing a corresponding sequence of images.

S4: the combined features of the image sequence features and prototype vectors obtained in S3 are decoded into a story descriptive text by a story decoder with attention mechanism.

In this embodiment, the step S4 may be implemented by the following steps:

s41: stitching the prototype vector in S32 with the image sequence features for initializing the hidden layer state of the gated loop unit of the story decoderh ₀：

Whereinh ₀Representing an initial hidden state of the gated loop unit, [;]the concatenation of the vectors is represented and,W _init∈R ^{dk dk2×}is a mapping matrix obtained by learning;

∈R ^dk1×is the image sequence feature of a single image sequence in the query set in S31,protofor the prototype vector obtained in S32,

the prototype vector introduced in the invention is used as the representation of the story under a theme, and captures elements common to the visual story under the current theme, such as emotional tendency, narration style and the like, word preference and the like. Initializing hidden states of gated cyclic units by stitching prototype vectors to image sequencesh ₀The subject information captured by the prototype vector may be made to go through the decoding stage, thereby directing the generation of the story descriptive text.

S42: according to the hidden layer state of the last moment of the gate control circulation unith _t-1And words predicted at the previous timewPredicting the current timetHidden layer state ofh _t：

h _t=GRU(h _t-1 ,E∙w _t-1)

Wherein GRU represents a gated loop unit that runs in a single step,Ethe matrix is embedded for the words and,w _t-1a one-hot vector representing a word predicted at a previous time;

wherein,c _tto representtVisual context characteristics of the moment;

wherein,p _wtrepresenting predictionstThe probability distribution of the words at the time of day,W _proj∈R ^{dk dk2×}andb _proj∈R ^dk1×respectively, mapping matrix and bias by learningAnd setting a coefficient.

Compared with using only gated cyclic unit timetHidden layer state ofh _tTo predicttTemporal word probability distribution, incorporated in the inventiontTemporal visual context features may enable a visual story model to better capture visual information and mitigate information loss due to a forgetting mechanism of a gated round robin unit to improve the correlation between the generated story text description and the given image sequence.

S5: and optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method.

In this embodiment, the step S5 may be implemented by the following steps:

s51: and constructing a visual story narration model by taking the processes of the steps S2-S4 as a framework. For the sample in S11NEach of the themes

And adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to the theme and correspond to each theme, wherein the gradient descent formula is as follows:

representation of parametersθDerivation is carried out;

s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ：

Wherein, E [. C]The display of the user can be expected to be,

in order to distribute the subject matter of all the subjects,

representing topics sampled from all topics.

In this embodiment, the step S6 may be implemented by the following steps:

s61: in the model speculation stage, the gradient descent method in S51 is used to adjust parameters according to the support set of the new theme, so that the visual narration model parameters are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ'：

Wherein,θ'to adapt the model to the new parameters of the new topic,

a new theme is shown for the speculation phase,

as a subject

The model loss obtained by the above calculation;

here by using the adjusted parametersθ'The visual narration model adapts adequately to the current topic, and can produce a narrative text description with better relevance and expressiveness.

Therefore, in the step, the visual story narration model can generate the visual story description which is more consistent with the theme by adjusting the parameters through a few samples, the dependence of the model on the number of the new theme samples is reduced, and the method is more suitable for practical application scenes.

To verify the effect of the invention, the method of the invention was tested using the VIST dataset. A total of 41807 story samples were used for meta-training using the 50 themes with the largest number of samples, and 2031 samples were used for meta-testing for the remaining 19 themes. The test uses the automatic evaluation matrices BLEU and METEOR. On the new theme with few samples, the visual story narration effect generated by the method is good and is obviously superior to the existing fully supervised pre-training model, and the test results are shown in the following table:

method of producing a composite material	BLEU	METEOR
			Fully supervised pre-training model	6.3	29.0
The invention	8.1	31.1

In an example of the few-sample visual narration of this embodiment, given an image sequence containing 5 photos, Supervised is the story description generated by the Supervised model, TAVS is the story description generated by the present invention, and group-try is the artificial annotation result. In the case of learning only a small number of samples, the supervised model wrongly identifies graduation ceremonies as parades and the language expression is rigid. The invention can well adjust the parameters, generates the description about graduation ceremonies, better accords with the image sequence theme, and has more flexible language expression.

In general, the present invention enables models to fully capture the visual features and linguistic styles of new topics by encoding a small number of training samples of the new topics into prototypes and providing them to the visual narrative model as a reference during the inference stage, resulting in a narrative with better relevance and expressiveness.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A method for small sample visual narration based on theme adaptation and prototype coding, comprising the steps of:

2. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 1, wherein the specific method of S1 is as follows:

3. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 2, wherein the specific sub-steps of S2 are as follows:

Features of each image sequence

Semantic information characterizing an image sequence.

4. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 3, wherein in the step S22, for each image sequence in the support setA _i={a ₁,…,a _m}，a _jIs shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea _jIs characterized in thatf _jObtaining a set of image characteristics corresponding to the image sequenceF _I= {f ₁,…,f _mAnd will be assembledF _IEach feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v₁,…,v_mIn which v is_jPresentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit_mAs a feature of the image sequence characterizing the image sequence

。

5. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 4, wherein the specific sub-steps of S3 are as follows:

s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleSign forV _qry={v'₁,…,v'_mV 'therein'_jIndicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment'_mAs a feature of the image sequence characterizing the image sequence

；

the superscript T of (a) denotes transpose.

6. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 5, wherein the specific sub-steps of S4 are as follows:

wherein,c _tto representtVisual context characteristics of the moment;

7. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 6, wherein the specific sub-steps of S5 are as follows:

8. The few-sample visual narration method based on topic adaptation and prototype coding according to claim 7, wherein in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:

representation of parametersθDerivation is carried out;

wherein, E [. C]The display of the user can be expected to be,

in order to distribute the subject matter of all the subjects,

representing topics sampled from all topics.

9. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 8, wherein the specific sub-steps of S6 are as follows:

s61: in the model guessing stage, the gradient descent method described in S51 is used to adjust the parameters according to the support set of the new theme, so that the parameters of the visual narration model are quickly adapted to the new theme to obtain a model adapted to the new themeNew parameters after a topicθ'；

10. The method of claim 3, wherein the text encoder, the visual semantic encoder, and the story decoder are each a recurrent neural network based on gated recurrent units.