CN111708904A - Few-sample visual story narration method based on theme adaptation and prototype coding - Google Patents
Few-sample visual story narration method based on theme adaptation and prototype coding Download PDFInfo
- Publication number
- CN111708904A CN111708904A CN202010857191.9A CN202010857191A CN111708904A CN 111708904 A CN111708904 A CN 111708904A CN 202010857191 A CN202010857191 A CN 202010857191A CN 111708904 A CN111708904 A CN 111708904A
- Authority
- CN
- China
- Prior art keywords
- visual
- story
- image sequence
- model
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000006978 adaptation Effects 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 22
- 125000004122 cyclic group Chemical group 0.000 claims description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000011478 gradient descent method Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000009795 derivation Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims 2
- 238000012360 testing method Methods 0.000 description 3
- 241001522296 Erithacus rubecula Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/535—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a few-sample visual story narration method based on theme adaptation and prototype coding. The method comprises the steps that firstly, a data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The invention combines prototype coding and meta-learning methods, and the constructed model has the capability of quickly adapting to the theme and can better generate the story text description which accords with the theme of the image sequence.
Description
Technical Field
The invention relates to visual languages, in particular to a few-sample visual story narration method based on theme adaptation and prototype coding.
Background
Visual and Language (Vision and Language) is an interdisciplinary discipline that integrates computer Vision with natural Language processing. With the great breakthrough brought by the deep learning technology in two fields, the cross-modal tasks such as image summarization, image question answering, image retrieval and the like generate profound results. Recently, researchers have further begun exploring Visual Storytelling tasks (Visual Storytelling) for generating narrative stories from image sequences.
In a visual narration task, given a sequence of images with contextual associations, the model is asked to output a story described in natural language with a narrative style. The task characteristics of the visual story narration task require that the model not only can correctly identify objects and attributes thereof in the images, but also can fully understand the association among a plurality of images, mine the implicit information in the image sequence from the aspects of time and space, and carry out proper inference according to the change of visual contents, thereby finally generating a coherent and smooth narrative story. Visual narrative techniques can generate descriptions for sequences of images taken by users for quick sharing to social media or for later retrieval. As a more complex cross-modal task, visual narration may also reflect the level of the intelligence's ability to understand image sequences and organize natural language.
The mainstream visual narration model is inspired by an image abstract model at present, adopts a hierarchical encoder-decoder framework and is trained based on a supervised learning mode. Much of the previous work has focused on designing complex model structures, which typically require large amounts of manually annotated data. However, the labeling of visual narration tasks is expensive and complex, and therefore cannot annotate large amounts of new data, which becomes a bottleneck for supervised learning methods. On the other hand, previous work studies on topic models have shown that topics in the real world generally follow a long tail distribution, which means that there are many new topics not covered in the training data set in the practical application scenario, and the number of samples of these new topics is rare. Therefore, the traditional supervision model is not suitable for a new theme with rare samples, and the visual story narration under the scene with few samples is considered to be closer to the application scene in real life.
Disclosure of Invention
The invention aims to provide a few-sample visual story-narrating method based on theme adaptation and prototype coding, aiming at the problems that themes are distributed in a long tail form in a visual story-narrating task, and the number of new theme samples is rare and is not suitable for a traditional supervision model.
In order to achieve the above purpose, the invention specifically adopts the following technical scheme:
a method of small sample visual narration based on theme adaptation and prototype coding comprising the steps of:
s1: dividing a visual story data set according to themes, sampling a batch of themes in each training turn, and dividing each theme into a support set and a query set;
s2: respectively coding story texts and image sequences in the vision story samples supporting concentration used for training into story characteristics and image sequence characteristics, and storing for later use;
s3: extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining story features and image sequence features of the support set in S2;
s4: decoding the combined features of the image sequence features and the prototype vectors obtained in the S3 into a story descriptive text through a story decoder with an attention mechanism;
s5: optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method;
s6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
Based on the technical scheme, the steps of the invention can be further realized by adopting the following specific mode.
Preferably, the specific method of S1 is as follows:
s11: dividing the visual story data set according to subjects, and training and sampling each roundNA theme and sample 2 from each themeKA sample of visual stories whereinKOne as support set for less sample training, the restKAnd the query set is used for verifying the learning effect of the few samples.
Preferably, the specific sub-steps of S2 are as follows:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtainImage sequence feature setFeatures of each image sequenceSemantic information characterizing an image sequence.
Further, in the S22, for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining a set of image characteristics corresponding to the image sequenceF I = {f 1 ,…,f m And will be assembledF I Each feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence。
Further, the specific sub-steps of S3 are as follows:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a representation of the image sequenceImage sequence feature of;
S32: a story prototype vector is further calculated by an attention mechanism, incorporating the story features and image sequence features of the support set described in S2:
wherein,proto∈R dk1×a prototype vector is represented by a vector of a prototype,d k representing the degree of the feature, softmax (·) represents the softmax function,the superscript T of (a) denotes transpose.
Further, the specific sub-steps of S4 are as follows:
s41: splicing the prototype vector with the image sequence features for initializing the hidden layer state of the gating circulation unit of the story decoderh 0 ;
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t ;
S43: by means of attention, computingtVisual context characteristics at the moment:
wherein,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
wherein,p wt representing predictionstThe probability distribution of the words at the time of day,W proj ∈R dk dk2×andb proj ∈R dk1×respectively, a mapping matrix and a bias coefficient obtained by learning.
Further, the specific sub-steps of S5 are as follows:
s51: constructing a visual story narration model by taking S2-S4 as a framework, and regarding the samples in S11NEach of the themesAdjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to each theme and correspond to each theme;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ。
Further, in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:
wherein,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,representation of parametersθDerivation is carried out;
in the step S52, the initial parameters of the model are further optimizedθThe overall loss function used is:
wherein, E [. C]The display of the user can be expected to be,in order to distribute the subject matter of all the subjects,representing topics sampled from all topics.
Further, the specific sub-steps of S6 are as follows:
s61: in the stage of model speculation, parameters are adjusted according to the support set of the new theme by using the gradient descent method described in S51, so that the parameters of the visual narration model are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ';
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
Preferably, the text encoder, the visual semantic encoder and the story decoder are all a gate-controlled cyclic unit-based cyclic neural network.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with a visual story narration method based on supervised learning, the visual story narration method based on meta-learning has better theme generalization capability. According to the invention, the model parameters can be rapidly adjusted through a small number of samples of the new theme, so that the visual story description which is more consistent with the theme is generated, the dependence of the model on the number of samples of the new theme is reduced, and the method is more suitable for practical application scenes.
2. According to the invention, a small amount of training samples of the new theme are coded into prototypes and are provided for the visual narration model as reference in the guessing stage, so that the model can fully capture the visual characteristics and language style of the new theme, and the generated story description has better relevance and expressiveness.
Drawings
Fig. 1 is a flow diagram of a method of small sample visual narration based on topic adaptation and prototype coding.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
In a preferred embodiment of the present invention, as shown in fig. 1, a sample-less visual narration method based on theme adaptation and prototype coding is provided. The basic concept of the invention is that firstly, a data set is divided according to subjects, a batch of subjects are sampled in each training round, and each subject is divided into a support set and a query set; extracting time sequence visual semantic features and image sequence features from samples in a query set, and calculating prototype vectors by combining story features and image sequence features extracted in advance in a support set; and combining the prototype vector with the image sequence characteristics, and further decoding to obtain a story description text. The overall visual story model further optimizes initial parameters of the model according to the synthetic loss calculated on the query set through a meta-learning method. In the guessing stage, the model adjusts parameters through a few samples, and story text is generated according to a new image sequence. The whole framework of the invention can be divided into a prototype coding part and a visual context coding part, wherein a text coder, a visual semantic coder and a story decoder are all a gate-controlled cyclic unit (GRU) -based cyclic neural network.
The following is a description of specific steps of the few-sample visual narration method based on theme adaptation and prototype coding, and the specific steps are as follows:
s1: the visual story data set is divided according to themes, a batch of themes are sampled in each training turn, and each theme is divided into a support set and a query set.
In this embodiment, the specific partitioning method is as follows:
s11: the visual story data set is divided by subject, eachWheel training samplingNA theme and sample 2 from each themeKA sample of visual stories whereinKAs a support setD spt For training with few samples, remainderKIs taken as a query setD qry The method is used for verifying the learning effect of the few samples. Wherein,N、Kthe specific value of (a) can be determined according to the specific situation of the data set, so as to meet the training requirement of the model. For example, in a set of wedding-themed visual story samples, each sample contains a picture and its corresponding story text Truth value, i.e., group-Truth, for subsequent model training.
S2: and respectively coding the story text and the image sequence in the vision story sample supporting the concentration for training into story characteristics and image sequence characteristics, and storing for later use.
In this embodiment, the step S2 may be implemented by the following steps:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
S22: extracting image sequence characteristics from all image sequences in the support set by using a convolutional neural network and a visual semantic encoder to obtain an image sequence characteristic setFeatures of each image sequenceSemantic information characterizing an image sequence.
Wherein for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining the corresponding image of the image sequenceSet of featuresF I = {f 1 ,…,f m And will be assembledF I Each of the characteristics off j Sequentially sending the images into a visual semantic encoder based on a gating cycle unit to obtain time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence。mThe specific value of (a) can be adjusted as required, and in this embodiment, the setting ism=5。
S3: and (4) extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining the feature of supporting concentrated stories and the image sequence features in S2.
In this embodiment, the step S3 may be implemented by the following steps:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a feature of the image sequence characterizing the image sequence;
S32: by means of an attention mechanism, a story prototype vector is further calculated by combining story features and image sequence features of the support set obtained in S2:
wherein,proto∈R dk1×a prototype vector is represented by a vector of a prototype,d k representing the degree of the feature, softmax (·) represents the softmax function,∈R dk1×is an image sequence feature of a single image sequence in the query set,∈R K dk×a set of image sequence features representing a support set, a superscript T representing a transpose,S spt ∈R K dk×a set of story features representing a corresponding sequence of images.
S4: the combined features of the image sequence features and prototype vectors obtained in S3 are decoded into a story descriptive text by a story decoder with attention mechanism.
In this embodiment, the step S4 may be implemented by the following steps:
s41: stitching the prototype vector in S32 with the image sequence features for initializing the hidden layer state of the gated loop unit of the story decoderh 0 :
Whereinh 0 Representing an initial hidden state of the gated loop unit, [;]the concatenation of the vectors is represented and,W init ∈R dk dk2×is a mapping matrix obtained by learning;∈R dk1×is the image sequence feature of a single image sequence in the query set in S31,protofor the prototype vector obtained in S32,
the prototype vector introduced in the invention is used as the representation of the story under a theme, and captures elements common to the visual story under the current theme, such as emotional tendency, narration style and the like, word preference and the like. Initializing hidden states of gated cyclic units by stitching prototype vectors to image sequencesh 0 The subject information captured by the prototype vector may be made to go through the decoding stage, thereby directing the generation of the story descriptive text.
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t :
h t =GRU(h t-1 ,E∙w t-1 )
Wherein GRU represents a gated loop unit that runs in a single step,Ethe matrix is embedded for the words and,w t-1 a one-hot vector representing a word predicted at a previous time;
s43: by means of attention, computingtVisual context characteristics at the moment:
wherein,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
wherein,p wt representing predictionstThe probability distribution of the words at the time of day,W proj ∈R dk dk2×andb proj ∈R dk1×respectively, mapping matrix and bias by learningAnd setting a coefficient.
Compared with using only gated cyclic unit timetHidden layer state ofh t To predicttTemporal word probability distribution, incorporated in the inventiontTemporal visual context features may enable a visual story model to better capture visual information and mitigate information loss due to a forgetting mechanism of a gated round robin unit to improve the correlation between the generated story text description and the given image sequence.
S5: and optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method.
In this embodiment, the step S5 may be implemented by the following steps:
s51: and constructing a visual story narration model by taking the processes of the steps S2-S4 as a framework. For the sample in S11NEach of the themesAnd adjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to the theme and correspond to each theme, wherein the gradient descent formula is as follows:
wherein,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,representation of parametersθDerivation is carried out;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ:
Wherein, E [. C]The display of the user can be expected to be,in order to distribute the subject matter of all the subjects,representing topics sampled from all topics.
S6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
In this embodiment, the step S6 may be implemented by the following steps:
s61: in the model speculation stage, the gradient descent method in S51 is used to adjust parameters according to the support set of the new theme, so that the visual narration model parameters are quickly adapted to the new theme, and new parameters of the model after being adapted to the new theme are obtainedθ':
Wherein,θ'to adapt the model to the new parameters of the new topic,a new theme is shown for the speculation phase,as a subjectThe model loss obtained by the above calculation;
here by using the adjusted parametersθ'The visual narration model adapts adequately to the current topic, and can produce a narrative text description with better relevance and expressiveness.
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
Therefore, in the step, the visual story narration model can generate the visual story description which is more consistent with the theme by adjusting the parameters through a few samples, the dependence of the model on the number of the new theme samples is reduced, and the method is more suitable for practical application scenes.
To verify the effect of the invention, the method of the invention was tested using the VIST dataset. A total of 41807 story samples were used for meta-training using the 50 themes with the largest number of samples, and 2031 samples were used for meta-testing for the remaining 19 themes. The test uses the automatic evaluation matrices BLEU and METEOR. On the new theme with few samples, the visual story narration effect generated by the method is good and is obviously superior to the existing fully supervised pre-training model, and the test results are shown in the following table:
method of producing a composite material | BLEU | METEOR |
Fully supervised pre-training model | 6.3 | 29.0 |
The invention | 8.1 | 31.1 |
In an example of the few-sample visual narration of this embodiment, given an image sequence containing 5 photos, Supervised is the story description generated by the Supervised model, TAVS is the story description generated by the present invention, and group-try is the artificial annotation result. In the case of learning only a small number of samples, the supervised model wrongly identifies graduation ceremonies as parades and the language expression is rigid. The invention can well adjust the parameters, generates the description about graduation ceremonies, better accords with the image sequence theme, and has more flexible language expression.
In general, the present invention enables models to fully capture the visual features and linguistic styles of new topics by encoding a small number of training samples of the new topics into prototypes and providing them to the visual narrative model as a reference during the inference stage, resulting in a narrative with better relevance and expressiveness.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.
Claims (10)
1. A method for small sample visual narration based on theme adaptation and prototype coding, comprising the steps of:
s1: dividing a visual story data set according to themes, sampling a batch of themes in each training turn, and dividing each theme into a support set and a query set;
s2: respectively coding story texts and image sequences in the vision story samples supporting concentration used for training into story characteristics and image sequence characteristics, and storing for later use;
s3: extracting time sequence visual semantic features and image sequence features from the image sequence in the query stage, and calculating to obtain a prototype vector by combining story features and image sequence features of the support set in S2;
s4: decoding the combined features of the image sequence features and the prototype vectors obtained in the S3 into a story descriptive text through a story decoder with an attention mechanism;
s5: optimizing initial parameters of the visual narration model by using the comprehensive loss of the visual narration model constructed by taking S2-S4 as a framework on a query set through a meta-learning method;
s6: in the guessing stage, few samples are learned according to the support set of the new theme to adjust parameters of the visual narration model, and then the visual narration model with the adjusted parameters is used for generating the narrative description text for the samples in the query set.
2. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 1, wherein the specific method of S1 is as follows:
s11: dividing the visual story data set according to subjects, and training and sampling each roundNA theme and sample 2 from each themeKA sample of visual stories whereinKOne as support set for less sample training, the restKAnd the query set is used for verifying the learning effect of the few samples.
3. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 2, wherein the specific sub-steps of S2 are as follows:
s21: story features are extracted from story text after all samples in a support set pass through a word embedding layer by using a text encoder based on a gated cyclic unitS spt ={s1,…,sK};
4. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 3, wherein in the step S22, for each image sequence in the support setA i ={a 1,…,a m },a j Is shown asjThe number of images is one of,mfor the length of the image sequence, the convolutional neural network extracts each image in the image sequencea j Is characterized in thatf j Obtaining a set of image characteristics corresponding to the image sequenceF I = {f 1 ,…,f m And will be assembledF I Each feature is sent to a visual semantic encoder based on a gating cycle unit in sequence to obtain the time sequence visual semantic features of the image sequenceV={v1,…,v m In which v is j Presentation processing support set-time gated loop unitjThe hidden state of the time is taken as the visual semantic feature v of the last time of the gating cycle unit m As a feature of the image sequence characterizing the image sequence。
5. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 4, wherein the specific sub-steps of S3 are as follows:
s31: for each sample in the query set, the same convolutional neural network and visual semantic encoder as in S2 are used to extract the time-sequential visual semantic features of the image sequence in the sampleSign forV qry ={v'1,…,v' m V 'therein' j Indicating a time-gated loop unit for processing a set of queriesjTaking the visual semantic feature v 'of the last moment of the gating cycle unit in the hidden state of the moment' m As a feature of the image sequence characterizing the image sequence;
S32: a story prototype vector is further calculated by an attention mechanism, incorporating the story features and image sequence features of the support set described in S2:
6. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 5, wherein the specific sub-steps of S4 are as follows:
s41: splicing the prototype vector with the image sequence features for initializing the hidden layer state of the gating circulation unit of the story decoderh 0 ;
S42: according to the hidden layer state of the last moment of the gate control circulation unith t-1 And words predicted at the previous timewPredicting the current timetHidden layer state ofh t ;
S43: by means of attention, computingtVisual context characteristics at the moment:
wherein,c t to representtVisual context characteristics of the moment;
s43: by usingtTemporal visual context features and gated cyclic unit hidden state predictiontWord probability distribution at time:
wherein,p wt representing predictionstThe probability distribution of the words at the time of day,W proj ∈R dk dk2×andb proj ∈R dk1×respectively, a mapping matrix and a bias coefficient obtained by learning.
7. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 6, wherein the specific sub-steps of S5 are as follows:
s51: constructing a visual story narration model by taking S2-S4 as a framework, and regarding the samples in S11NEach of the themesAdjusting parameters by using a gradient descent method to obtain a set of model parameters which are adjusted according to each theme and correspond to each theme;
s52: by minimizingNThe comprehensive loss of each subject on the query set further optimizes the initial parameters of the modelθ。
8. The few-sample visual narration method based on topic adaptation and prototype coding according to claim 7, wherein in S51, the formula for calculating the adjusted model parameters by using the gradient descent method is as follows:
wherein,θ i 'indicating the initial parameter is iniNew parameters obtained after adjustment on the subjects,θthe initial parameters of the model are represented,f θ expressed in the initial parametersθThe model of the following model is shown,is as followsiA model loss calculated on each topic, the loss being obtained by calculating the cross entropy of the word distribution and the true distribution,αin order to update the learning rate of the parameters,representation of parametersθDerivation is carried out;
in the step S52, the initial parameters of the model are further optimizedθThe overall loss function used is:
9. The few-sample visual narration method based on theme adaptation and prototype coding according to claim 8, wherein the specific sub-steps of S6 are as follows:
s61: in the model guessing stage, the gradient descent method described in S51 is used to adjust the parameters according to the support set of the new theme, so that the parameters of the visual narration model are quickly adapted to the new theme to obtain a model adapted to the new themeNew parameters after a topicθ';
S62: using with new parametersθ'Visual story narration modelf θ' Story description text is generated for the sequence of images at the guessing stage.
10. The method of claim 3, wherein the text encoder, the visual semantic encoder, and the story decoder are each a recurrent neural network based on gated recurrent units.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010857191.9A CN111708904A (en) | 2020-08-24 | 2020-08-24 | Few-sample visual story narration method based on theme adaptation and prototype coding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010857191.9A CN111708904A (en) | 2020-08-24 | 2020-08-24 | Few-sample visual story narration method based on theme adaptation and prototype coding |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111708904A true CN111708904A (en) | 2020-09-25 |
Family
ID=72547444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010857191.9A Pending CN111708904A (en) | 2020-08-24 | 2020-08-24 | Few-sample visual story narration method based on theme adaptation and prototype coding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708904A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377990A (en) * | 2021-06-09 | 2021-09-10 | 电子科技大学 | Video/picture-text cross-modal matching training method based on meta-self learning |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | Story description generation method based on knowledge enhanced attention network and group-level semantics |
CN113762474A (en) * | 2021-08-26 | 2021-12-07 | 厦门大学 | Story ending generation method and storage medium for adaptive theme |
CN113779938A (en) * | 2021-08-13 | 2021-12-10 | 同济大学 | System and method for generating coherent stories based on vision and theme cooperative attention |
CN114419402A (en) * | 2022-03-29 | 2022-04-29 | 中国人民解放军国防科技大学 | Image story description generation method and device, computer equipment and storage medium |
CN114708473A (en) * | 2020-12-17 | 2022-07-05 | 复旦大学 | Data augmentation method, application and device for oracle identification of small sample |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555475A (en) * | 2019-08-29 | 2019-12-10 | 华南理工大学 | few-sample target detection method based on semantic information fusion |
CN111046979A (en) * | 2020-03-13 | 2020-04-21 | 成都晓多科技有限公司 | Method and system for discovering badcase based on small sample learning |
-
2020
- 2020-08-24 CN CN202010857191.9A patent/CN111708904A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555475A (en) * | 2019-08-29 | 2019-12-10 | 华南理工大学 | few-sample target detection method based on semantic information fusion |
CN111046979A (en) * | 2020-03-13 | 2020-04-21 | 成都晓多科技有限公司 | Method and system for discovering badcase based on small sample learning |
Non-Patent Citations (1)
Title |
---|
JIACHENG LI: "Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling", 《HTTPS://ARXIV.ORG/ABS/2008.04504》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114708473A (en) * | 2020-12-17 | 2022-07-05 | 复旦大学 | Data augmentation method, application and device for oracle identification of small sample |
CN113377990A (en) * | 2021-06-09 | 2021-09-10 | 电子科技大学 | Video/picture-text cross-modal matching training method based on meta-self learning |
CN113377990B (en) * | 2021-06-09 | 2022-06-14 | 电子科技大学 | Video/picture-text cross-modal matching training method based on meta-self learning |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | Story description generation method based on knowledge enhanced attention network and group-level semantics |
CN113515951B (en) * | 2021-07-19 | 2022-07-05 | 同济大学 | Story description generation method based on knowledge enhanced attention network and group-level semantics |
CN113779938A (en) * | 2021-08-13 | 2021-12-10 | 同济大学 | System and method for generating coherent stories based on vision and theme cooperative attention |
CN113779938B (en) * | 2021-08-13 | 2024-01-23 | 同济大学 | System and method for generating coherent stories based on visual and theme cooperative attention |
CN113762474A (en) * | 2021-08-26 | 2021-12-07 | 厦门大学 | Story ending generation method and storage medium for adaptive theme |
CN114419402A (en) * | 2022-03-29 | 2022-04-29 | 中国人民解放军国防科技大学 | Image story description generation method and device, computer equipment and storage medium |
CN114419402B (en) * | 2022-03-29 | 2023-08-18 | 中国人民解放军国防科技大学 | Image story description generation method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111708904A (en) | Few-sample visual story narration method based on theme adaptation and prototype coding | |
Keneshloo et al. | Deep reinforcement learning for sequence-to-sequence models | |
Fu et al. | Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts | |
US10664744B2 (en) | End-to-end memory networks | |
CN107979764B (en) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework | |
Venugopalan et al. | Sequence to sequence-video to text | |
CN108416065B (en) | Hierarchical neural network-based image-sentence description generation system and method | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN109325112B (en) | A kind of across language sentiment analysis method and apparatus based on emoji | |
CN108628935B (en) | Question-answering method based on end-to-end memory network | |
CN108763493A (en) | A kind of recommendation method based on deep learning | |
CN108153864A (en) | Method based on neural network generation text snippet | |
CN107844469A (en) | The text method for simplifying of word-based vector query model | |
CN108986186A (en) | The method and system of text conversion video | |
CN109919221B (en) | Image description method based on bidirectional double-attention machine | |
Zhang et al. | Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition | |
Li et al. | UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning | |
CN111125333B (en) | Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism | |
CN111274790A (en) | Chapter-level event embedding method and device based on syntactic dependency graph | |
CN110347831A (en) | Based on the sensibility classification method from attention mechanism | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN114491258B (en) | Keyword recommendation system and method based on multi-mode content | |
CN108664465A (en) | One kind automatically generating text method and relevant apparatus | |
CN110032729A (en) | A kind of autoabstract generation method based on neural Turing machine | |
CN107679225A (en) | A kind of reply generation method based on keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200925 |
|
RJ01 | Rejection of invention patent application after publication |