CN111708904A - Few-sample visual story narration method based on theme adaptation and prototype coding - Google Patents

Few-sample visual story narration method based on theme adaptation and prototype coding Download PDF

Info

Publication number
CN111708904A
CN111708904A CN202010857191.9A CN202010857191A CN111708904A CN 111708904 A CN111708904 A CN 111708904A CN 202010857191 A CN202010857191 A CN 202010857191A CN 111708904 A CN111708904 A CN 111708904A
Authority
CN
China
Prior art keywords
visual
story
image sequence
model
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010857191.9A
Other languages
Chinese (zh)
Inventor
庄越挺
浦世亮
汤斯亮
李嘉成
吴飞
肖俊
李玺
张世峰
任文奇
陆展鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010857191.9A priority Critical patent/CN111708904A/en
Publication of CN111708904A publication Critical patent/CN111708904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a few-sample visual story narration method based on theme adaptation and prototype coding. The method comprises the steps that firstly, a data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The invention combines prototype coding and meta-learning methods, and the constructed model has the capability of quickly adapting to the theme and can better generate the story text description which accords with the theme of the image sequence.

Description

Few-sample visual story narration method based on theme adaptation and prototype coding
Technical Field
The invention relates to visual languages, in particular to a few-sample visual story narration method based on theme adaptation and prototype coding.
Background
Visual and Language (Vision and Language) is an interdisciplinary discipline that integrates computer Vision with natural Language processing. With the great breakthrough brought by the deep learning technology in two fields, the cross-modal tasks such as image summarization, image question answering, image retrieval and the like generate profound results. Recently, researchers have further begun exploring Visual Storytelling tasks (Visual Storytelling) for generating narrative stories from image sequences.
In a visual narration task, given a sequence of images with contextual associations, the model is asked to output a story described in natural language with a narrative style. The task characteristics of the visual story narration task require that the model not only can correctly identify objects and attributes thereof in the images, but also can fully understand the association among a plurality of images, mine the implicit information in the image sequence from the aspects of time and space, and carry out proper inference according to the change of visual contents, thereby finally generating a coherent and smooth narrative story. Visual narrative techniques can generate descriptions for sequences of images taken by users for quick sharing to social media or for later retrieval. As a more complex cross-modal task, visual narration may also reflect the level of the intelligence's ability to understand image sequences and organize natural language.
The mainstream visual narration model is inspired by an image abstract model at present, adopts a hierarchical encoder-decoder framework and is trained based on a supervised learning mode. Much of the previous work has focused on designing complex model structures, which typically require large amounts of manually annotated data. However, the labeling of visual narration tasks is expensive and complex, and therefore cannot annotate large amounts of new data, which becomes a bottleneck for supervised learning methods. On the other hand, previous work studies on topic models have shown that topics in the real world generally follow a long tail distribution, which means that there are many new topics not covered in the training data set in the practical application scenario, and the number of samples of these new topics is rare. Therefore, the traditional supervision model is not suitable for a new theme with rare samples, and the visual story narration under the scene with few samples is considered to be closer to the application scene in real life.
Disclosure of Invention
The invention aims to provide a few-sample visual story-narrating method based on theme adaptation and prototype coding, aiming at the problems that themes are distributed in a long tail form in a visual story-narrating task, and the number of new theme samples is rare and is not suitable for a traditional supervision model.
In order to achieve the above purpose, the invention specifically adopts the following technical scheme:
a method of small sample visual narration based on theme adaptation and prototype coding comprising the steps of:
s1: dividing a visual story data set according to themes, sampling a batch of themes in each training turn, and dividing each theme into a support set and a query set;
s2: respectively coding story texts and image sequences in the vision story samples supporting concentration used for training into story characteristics and image sequence characteristics, and storing for later use;
s3: extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining story features and image sequence features of the support set in S2;
s4: decoding the combined features of the image sequence features and the prototype vectors obtained in the S3 into a story descriptive text through a story decoder with an attention mechanism;
s5: optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method;
s6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
Based on the technical scheme, the steps of the invention can be further realized by adopting the following specific mode.
Preferably, the specific method of S1 is as follows:
s11: dividing the visual story data set according to subjects, and training and sampling each roundNA theme and sample 2 from each themeKA sample of visual stories whereinKOne as support set for less sample training, the restKAnd the query set is used for verifying the learning effect of the few samples.
Preferably, the specific sub-steps of S2 are as follows:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtainImage sequence feature set
Figure 800030DEST_PATH_IMAGE001
Features of each image sequence
Figure 982750DEST_PATH_IMAGE002
Semantic information characterizing an image sequence.
Further, in the S22, for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining a set of image characteristics corresponding to the image sequenceF I = {f 1 ,…,f m And will be assembledF I Each feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence
Figure 100002_DEST_PATH_IMAGE003
Further, the specific sub-steps of S3 are as follows:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a representation of the image sequenceImage sequence feature of
Figure 683858DEST_PATH_IMAGE004
S32: a story prototype vector is further calculated by an attention mechanism, incorporating the story features and image sequence features of the support set described in S2:
Figure 610226DEST_PATH_IMAGE005
wherein the content of the first and second substances,protoR dka prototype vector is represented by a vector of a prototype,d k representing the degree of the feature, softmax (·) represents the softmax function,
Figure 831123DEST_PATH_IMAGE006
the superscript T of (a) denotes transpose.
Further, the specific sub-steps of S4 are as follows:
s41: splicing the prototype vector with the image sequence features for initializing the hidden layer state of the gating circulation unit of the story decoderh 0
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t
S43: by means of attention, computingtVisual context characteristics at the moment:
Figure 919165DEST_PATH_IMAGE007
wherein the content of the first and second substances,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
Figure 451777DEST_PATH_IMAGE008
wherein the content of the first and second substances,p wt representing predictionstThe probability distribution of the words at the time of day,W proj R dk dkandb proj R dkrespectively, a mapping matrix and a bias coefficient obtained by learning.
Further, the specific sub-steps of S5 are as follows:
s51: constructing a visual story narration model by taking S2-S4 as a framework, and regarding the samples in S11NEach of the themes
Figure 807934DEST_PATH_IMAGE009
Adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to each theme and correspond to each theme;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ
Further, in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:
Figure 742392DEST_PATH_IMAGE010
wherein the content of the first and second substances,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,
Figure 266914DEST_PATH_IMAGE011
is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,
Figure 162189DEST_PATH_IMAGE012
representation of parametersθDerivation is carried out;
in the step S52, the initial parameters of the model are further optimizedθThe overall loss function used is:
Figure 430360DEST_PATH_IMAGE013
wherein, E [. C]The display of the user can be expected to be,
Figure 219324DEST_PATH_IMAGE014
in order to distribute the subject matter of all the subjects,
Figure 304961DEST_PATH_IMAGE015
representing topics sampled from all topics.
Further, the specific sub-steps of S6 are as follows:
s61: in the stage of model speculation, parameters are adjusted according to the support set of the new theme by using the gradient descent method described in S51, so that the parameters of the visual narration model are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ'
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
Preferably, the text encoder, the visual semantic encoder and the story decoder are all a gate-controlled cyclic unit-based cyclic neural network.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with a visual story narration method based on supervised learning, the visual story narration method based on meta-learning has better theme generalization capability. According to the invention, the model parameters can be rapidly adjusted through a small number of samples of the new theme, so that the visual story description which is more consistent with the theme is generated, the dependence of the model on the number of samples of the new theme is reduced, and the method is more suitable for practical application scenes.
2. According to the invention, a small amount of training samples of the new theme are coded into prototypes and are provided for the visual narration model as reference in the guessing stage, so that the model can fully capture the visual characteristics and language style of the new theme, and the generated story description has better relevance and expressiveness.
Drawings
Fig. 1 is a flow diagram of a method of small sample visual narration based on topic adaptation and prototype coding.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
In a preferred embodiment of the present invention, as shown in fig. 1, a sample-less visual narration method based on theme adaptation and prototype coding is provided. The basic concept of the invention is that firstly, a data set is divided according to subjects, a batch of subjects are sampled in each training round, and each subject is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The whole framework of the invention can be divided into a prototype coding part and a visual context coding part, wherein a text coder, a visual semantic coder and a story decoder are all a gate-controlled cyclic unit (GRU) -based cyclic neural network.
The following is a description of specific steps of the few-sample visual narration method based on theme adaptation and prototype coding, and the specific steps are as follows:
s1: the visual story data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set.
In this embodiment, the specific partitioning method is as follows:
s11: the visual story data set is divided by subject, eachWheel training samplingNA theme and sample 2 from each themeKA sample of visual stories whereinKAs a support setD spt For training with few samples, remainderKIs taken as a query setD qry The method is used for verifying the learning effect of the few samples. Wherein the content of the first and second substances,N、Kthe specific value of (a) can be determined according to the specific situation of the data set, so as to meet the training requirement of the model. For example, in a set of wedding-themed visual story samples, each sample contains a picture and its corresponding story text Truth value, i.e., group-Truth, for subsequent model training.
S2: and respectively coding the story text and the image sequence in the vision story sample supporting the concentration for training into story characteristics and image sequence characteristics, and storing for later use.
In this embodiment, the step S2 may be implemented by the following steps:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtain an image sequence characteristic set
Figure 812165DEST_PATH_IMAGE016
Features of each image sequence
Figure 493814DEST_PATH_IMAGE002
Semantic information characterizing an image sequence.
Wherein for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining the corresponding image of the image sequenceSet of featuresF I = {f 1 ,…,f m And will be assembledF I Each of the characteristics off j Sequentially sending the images into a visual semantic encoder based on a gating cycle unit to obtain time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence
Figure 402864DEST_PATH_IMAGE003
mThe specific value of (a) can be adjusted as required, and in this embodiment, the setting ism=5。
S3: and (4) extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining the feature of supporting concentrated stories and the image sequence features in S2.
In this embodiment, the step S3 may be implemented by the following steps:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a feature of the image sequence characterizing the image sequence
Figure 3609DEST_PATH_IMAGE017
S32: by means of an attention mechanism, a story prototype vector is further calculated by combining story features and image sequence features of the support set obtained in S2:
Figure 889788DEST_PATH_IMAGE018
wherein the content of the first and second substances,protoR dka prototype vector is represented by a vector of a prototype,d k representing the degree of the feature, softmax (·) represents the softmax function,
Figure 234182DEST_PATH_IMAGE019
R dkis an image sequence feature of a single image sequence in the query set,
Figure DEST_PATH_IMAGE020
R K dk×a set of image sequence features representing a support set, a superscript T representing a transpose,S spt R K dk×a set of story features representing a corresponding sequence of images.
S4: the combined features of the image sequence features and prototype vectors obtained in S3 are decoded into a story descriptive text by a story decoder with attention mechanism.
In this embodiment, the step S4 may be implemented by the following steps:
s41: stitching the prototype vector in S32 with the image sequence features for initializing the hidden layer state of the gated loop unit of the story decoderh 0
Figure 404263DEST_PATH_IMAGE021
Whereinh 0 Representing an initial hidden state of the gated loop unit, [;]the concatenation of the vectors is represented and,W init R dk dkis a mapping matrix obtained by learning;
Figure 441489DEST_PATH_IMAGE019
R dkis the image sequence feature of a single image sequence in the query set in S31,protofor the prototype vector obtained in S32,
the prototype vector introduced in the invention is used as the representation of the story under a theme, and captures elements common to the visual story under the current theme, such as emotional tendency, narration style and the like, word preference and the like. Initializing hidden states of gated cyclic units by stitching prototype vectors to image sequencesh 0 The subject information captured by the prototype vector may be made to go through the decoding stage, thereby directing the generation of the story descriptive text.
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t
h t =GRU(h t-1 ,E∙w t-1 )
Wherein GRU represents a gated loop unit that runs in a single step,Ethe matrix is embedded for the words and,w t-1 a one-hot vector representing a word predicted at a previous time;
s43: by means of attention, computingtVisual context characteristics at the moment:
Figure 923286DEST_PATH_IMAGE007
wherein the content of the first and second substances,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
Figure 196004DEST_PATH_IMAGE008
wherein the content of the first and second substances,p wt representing predictionstThe probability distribution of the words at the time of day,W proj R dk dkandb proj R dkrespectively, mapping matrix and bias by learningAnd setting a coefficient.
Compared with using only gated cyclic unit timetHidden layer state ofh t To predicttTemporal word probability distribution, incorporated in the inventiontTemporal visual context features may enable a visual story model to better capture visual information and mitigate information loss due to a forgetting mechanism of a gated round robin unit to improve the correlation between the generated story text description and the given image sequence.
S5: and optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method.
In this embodiment, the step S5 may be implemented by the following steps:
s51: and constructing a visual story narration model by taking the processes of the steps S2-S4 as a framework. For the sample in S11NEach of the themes
Figure 79647DEST_PATH_IMAGE009
And adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to the theme and correspond to each theme, wherein the gradient descent formula is as follows:
Figure 287774DEST_PATH_IMAGE010
wherein the content of the first and second substances,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,
Figure 132233DEST_PATH_IMAGE011
is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,
Figure 818430DEST_PATH_IMAGE012
representation of parametersθDerivation is carried out;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ
Figure 556579DEST_PATH_IMAGE013
Wherein, E [. C]The display of the user can be expected to be,
Figure 827285DEST_PATH_IMAGE014
in order to distribute the subject matter of all the subjects,
Figure 18095DEST_PATH_IMAGE015
representing topics sampled from all topics.
S6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
In this embodiment, the step S6 may be implemented by the following steps:
s61: in the model speculation stage, the gradient descent method in S51 is used to adjust parameters according to the support set of the new theme, so that the visual narration model parameters are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ'
Figure 507982DEST_PATH_IMAGE022
Wherein the content of the first and second substances,θ'to adapt the model to the new parameters of the new topic,
Figure 976004DEST_PATH_IMAGE023
a new theme is shown for the speculation phase,
Figure 525934DEST_PATH_IMAGE024
as a subject
Figure 594253DEST_PATH_IMAGE023
The model loss obtained by the above calculation;
here by using the adjusted parametersθ'The visual narration model adapts adequately to the current topic, and can produce a narrative text description with better relevance and expressiveness.
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
Therefore, in the step, the visual story narration model can generate the visual story description which is more consistent with the theme by adjusting the parameters through a few samples, the dependence of the model on the number of the new theme samples is reduced, and the method is more suitable for practical application scenes.
To verify the effect of the invention, the method of the invention was tested using the VIST dataset. A total of 41807 story samples were used for meta-training using the 50 themes with the largest number of samples, and 2031 samples were used for meta-testing for the remaining 19 themes. The test uses the automatic evaluation matrices BLEU and METEOR. On the new theme with few samples, the visual story narration effect generated by the method is good and is obviously superior to the existing fully supervised pre-training model, and the test results are shown in the following table:
method of producing a composite material BLEU METEOR
Fully supervised pre-training model 6.3 29.0
The invention 8.1 31.1
In an example of the few-sample visual narration of this embodiment, given an image sequence containing 5 photos, Supervised is the story description generated by the Supervised model, TAVS is the story description generated by the present invention, and group-try is the artificial annotation result. In the case of learning only a small number of samples, the supervised model wrongly identifies graduation ceremonies as parades and the language expression is rigid. The invention can well adjust the parameters, generates the description about graduation ceremonies, better accords with the image sequence theme, and has more flexible language expression.
In general, the present invention enables models to fully capture the visual features and linguistic styles of new topics by encoding a small number of training samples of the new topics into prototypes and providing them to the visual narrative model as a reference during the inference stage, resulting in a narrative with better relevance and expressiveness.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. A method for small sample visual narration based on theme adaptation and prototype coding, comprising the steps of:
s1: dividing a visual story data set according to themes, sampling a batch of themes in each training turn, and dividing each theme into a support set and a query set;
s2: respectively coding story texts and image sequences in the vision story samples supporting concentration used for training into story characteristics and image sequence characteristics, and storing for later use;
s3: extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining story features and image sequence features of the support set in S2;
s4: decoding the combined features of the image sequence features and the prototype vectors obtained in the S3 into a story descriptive text through a story decoder with an attention mechanism;
s5: optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method;
s6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
2. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 1, wherein the specific method of S1 is as follows:
s11: dividing the visual story data set according to subjects, and training and sampling each roundNA theme and sample 2 from each themeKA sample of visual stories whereinKOne as support set for less sample training, the restKAnd the query set is used for verifying the learning effect of the few samples.
3. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 2, wherein the specific sub-steps of S2 are as follows:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtain an image sequence characteristic set
Figure DEST_PATH_IMAGE001
Features of each image sequence
Figure 432097DEST_PATH_IMAGE002
Semantic information characterizing an image sequence.
4. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 3, wherein in the step S22, for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining a set of image characteristics corresponding to the image sequenceF I = {f 1 ,…,f m And will be assembledF I Each feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence
Figure DEST_PATH_IMAGE003
5. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 4, wherein the specific sub-steps of S3 are as follows:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleSign forV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a feature of the image sequence characterizing the image sequence
Figure 285783DEST_PATH_IMAGE004
S32: a story prototype vector is further calculated by an attention mechanism, incorporating the story features and image sequence features of the support set described in S2:
Figure 741035DEST_PATH_IMAGE005
wherein the content of the first and second substances,protoR dka prototype vector is represented by a vector of a prototype,d k representing the degree of the feature, softmax (·) represents the softmax function,
Figure DEST_PATH_IMAGE006
the superscript T of (a) denotes transpose.
6. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 5, wherein the specific sub-steps of S4 are as follows:
s41: splicing the prototype vector with the image sequence features for initializing the hidden layer state of the gating circulation unit of the story decoderh 0
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t
S43: by means of attention, computingtVisual context characteristics at the moment:
Figure 562230DEST_PATH_IMAGE007
wherein the content of the first and second substances,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
Figure DEST_PATH_IMAGE008
wherein the content of the first and second substances,p wt representing predictionstThe probability distribution of the words at the time of day,W proj R dk dkandb proj R dkrespectively, a mapping matrix and a bias coefficient obtained by learning.
7. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 6, wherein the specific sub-steps of S5 are as follows:
s51: constructing a visual story narration model by taking S2-S4 as a framework, and regarding the samples in S11NEach of the themes
Figure 800444DEST_PATH_IMAGE009
Adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to each theme and correspond to each theme;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ
8. The few-sample visual narration method based on topic adaptation and prototype coding according to claim 7, wherein in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:
Figure DEST_PATH_IMAGE010
wherein the content of the first and second substances,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,
Figure 898850DEST_PATH_IMAGE011
is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,
Figure DEST_PATH_IMAGE012
representation of parametersθDerivation is carried out;
in the step S52, the initial parameters of the model are further optimizedθThe overall loss function used is:
Figure 670542DEST_PATH_IMAGE013
wherein, E [. C]The display of the user can be expected to be,
Figure DEST_PATH_IMAGE014
in order to distribute the subject matter of all the subjects,
Figure 729765DEST_PATH_IMAGE015
representing topics sampled from all topics.
9. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 8, wherein the specific sub-steps of S6 are as follows:
s61: in the model guessing stage, the gradient descent method described in S51 is used to adjust the parameters according to the support set of the new theme, so that the parameters of the visual narration model are quickly adapted to the new theme to obtain a model adapted to the new themeNew parameters after a topicθ'
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
10. The method of claim 3, wherein the text encoder, the visual semantic encoder, and the story decoder are each a recurrent neural network based on gated recurrent units.
CN202010857191.9A 2020-08-24 2020-08-24 Few-sample visual story narration method based on theme adaptation and prototype coding Pending CN111708904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010857191.9A CN111708904A (en) 2020-08-24 2020-08-24 Few-sample visual story narration method based on theme adaptation and prototype coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010857191.9A CN111708904A (en) 2020-08-24 2020-08-24 Few-sample visual story narration method based on theme adaptation and prototype coding

Publications (1)

Publication Number Publication Date
CN111708904A true CN111708904A (en) 2020-09-25

Family

ID=72547444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010857191.9A Pending CN111708904A (en) 2020-08-24 2020-08-24 Few-sample visual story narration method based on theme adaptation and prototype coding

Country Status (1)

Country Link
CN (1) CN111708904A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377990A (en) * 2021-06-09 2021-09-10 电子科技大学 Video/picture-text cross-modal matching training method based on meta-self learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113762474A (en) * 2021-08-26 2021-12-07 厦门大学 Story ending generation method and storage medium for adaptive theme
CN113779938A (en) * 2021-08-13 2021-12-10 同济大学 System and method for generating coherent stories based on vision and theme cooperative attention
CN114419402A (en) * 2022-03-29 2022-04-29 中国人民解放军国防科技大学 Image story description generation method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555475A (en) * 2019-08-29 2019-12-10 华南理工大学 few-sample target detection method based on semantic information fusion
CN111046979A (en) * 2020-03-13 2020-04-21 成都晓多科技有限公司 Method and system for discovering badcase based on small sample learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555475A (en) * 2019-08-29 2019-12-10 华南理工大学 few-sample target detection method based on semantic information fusion
CN111046979A (en) * 2020-03-13 2020-04-21 成都晓多科技有限公司 Method and system for discovering badcase based on small sample learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIACHENG LI: "Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling", 《HTTPS://ARXIV.ORG/ABS/2008.04504》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377990A (en) * 2021-06-09 2021-09-10 电子科技大学 Video/picture-text cross-modal matching training method based on meta-self learning
CN113377990B (en) * 2021-06-09 2022-06-14 电子科技大学 Video/picture-text cross-modal matching training method based on meta-self learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113515951B (en) * 2021-07-19 2022-07-05 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113779938A (en) * 2021-08-13 2021-12-10 同济大学 System and method for generating coherent stories based on vision and theme cooperative attention
CN113779938B (en) * 2021-08-13 2024-01-23 同济大学 System and method for generating coherent stories based on visual and theme cooperative attention
CN113762474A (en) * 2021-08-26 2021-12-07 厦门大学 Story ending generation method and storage medium for adaptive theme
CN114419402A (en) * 2022-03-29 2022-04-29 中国人民解放军国防科技大学 Image story description generation method and device, computer equipment and storage medium
CN114419402B (en) * 2022-03-29 2023-08-18 中国人民解放军国防科技大学 Image story description generation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Keneshloo et al. Deep reinforcement learning for sequence-to-sequence models
CN111708904A (en) Few-sample visual story narration method based on theme adaptation and prototype coding
Fu et al. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
US10664744B2 (en) End-to-end memory networks
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
KR101855597B1 (en) Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
Pan et al. Video captioning with transferred semantic attributes
CN110737769B (en) Pre-training text abstract generation method based on neural topic memory
Venugopalan et al. Sequence to sequence-video to text
CN108416065B (en) Hierarchical neural network-based image-sentence description generation system and method
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN108628935B (en) Question-answering method based on end-to-end memory network
Chen et al. Deep Learning for Video Captioning: A Review.
CN108153864A (en) Method based on neural network generation text snippet
CN108986186A (en) The method and system of text conversion video
CN109874029A (en) Video presentation generation method, device, equipment and storage medium
CN109919221B (en) Image description method based on bidirectional double-attention machine
Zhang et al. Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN111125333B (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
Li et al. UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning
CN112417092A (en) Intelligent text automatic generation system based on deep learning and implementation method thereof
CN107679225A (en) A kind of reply generation method based on keyword

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200925

RJ01 Rejection of invention patent application after publication