CN108765383A - Video presentation method based on depth migration study - Google Patents
Video presentation method based on depth migration study Download PDFInfo
- Publication number
- CN108765383A CN108765383A CN201810465849.4A CN201810465849A CN108765383A CN 108765383 A CN108765383 A CN 108765383A CN 201810465849 A CN201810465849 A CN 201810465849A CN 108765383 A CN108765383 A CN 108765383A
- Authority
- CN
- China
- Prior art keywords
- semantic feature
- video
- frame
- input
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20164—Salient point detection; Corner detection
Abstract
The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.Include the following steps, 1)It is vector form by representation of video shot by convolutional neural networks representation of video shot model;2)Image semantic feature detection model is built using multi-instance learning, to extract image area semantic feature;3)By step 2)In image, semantic feature detection model move in frame basin, obtain frame stream semantic feature detection model, to extract frame stream semantic feature, and realize the depth integration of image area and frame basin semantic feature;4)It builds depth migration and learns video presentation frame, generate video natural language description.The present invention carries out depth integration to the semantic feature in input terminal not same area, to improve the accuracy rate for generating video presentation.
Description
Technical field
The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.
Background technology
Video presentation is to utilize natural language description video, is the emphasis of computer vision and natural language processing field
And difficult point, it is had broad application prospects in artificial intelligence field.
Video presentation is very different with iamge description, and video presentation is not only appreciated that the object in each frame, but also
It is appreciated that movement of the object between multiframe.Existing video presentation method mainly has following four classes:1) it will be examined in vision content
The word measured distributes to each sentence fragment, then goes to generate video presentation using predefined language template.Such methods
Highly dependent upon sentence template, the syntactic structure of the sentence of generation is relatively more fixed;2) study vision content is constituted with text sentence
The sentence of the probability distribution of joint space, generation has more flexible syntactic structure;3) it goes to train category using multi-instance learning
Property detector, then by one based on attribute detector output maximum entropy language model go generate video presentation;4) with volume
Centered on product neural network and Recognition with Recurrent Neural Network, by a simple linear transport unit, being dug from image with frame stream
The semantic feature dug combines, and generates video presentation.Preceding two classes method does not utilize semanteme during video presentation
Feature;Although two class methods consider semantic feature in input terminal afterwards, the semantic feature in not same area is not carried out
Depth integration.
Existing video presentation method descriptive semantics are not accurate enough, to improve the accuracy of description, therefore devise a kind of depth
Spend transfer learning video presentation model.
Invention content
The present invention to solve the above-mentioned problems, provides a kind of video presentation method learnt based on depth migration.
The present invention takes following technical scheme:A kind of video presentation method based on depth migration study, including following step
Suddenly,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature;
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature
Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature;
4) structure depth migration learns video presentation frame, generates video natural language description.
In the step 1), the task of representation of video shot is completed using convolutional neural networks model, for one in video
Group sample frame, each frame is input in convolutional neural networks model, the output of second full articulamentum is extracted, then in institute
Mean value pond is executed in some sample frames, is a n-dimensional vector one section of representation of video shot.
In the step 2), structure image semantic feature is gone using multi-instance learning on iamge description standard database
Detection model.
It is specific as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as
One positive closure;Otherwise, image I will be considered as a negative packet.Each packet is input in image, semantic feature detection model first,
Each packet is divided into multiple regions by full convolutional neural networks, packet is then calculated according to the probability of all areas in packet (example)
W containing semantic featureaPacket bIProbability, as shown in formula (1):
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is carried out by one sigmoid layers
It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks.In addition, full convolutional neural networks are most
The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every
For a packet, x × x dimensional feature figures are obtained.Then, model is optimized using cross entropy loss layer.Finally, using in image
Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each independent sample frame about all semantemes
The probability distribution of feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to language
The final expression of adopted feature.
In the step 3), the domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, should
The final goal of model is:Distribution for aiming field gives input x, can predict semantic feature y.
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d=
0, then x is from source domain;If d=1, x come from aiming field, semantic feature detection model can be decomposed into three parts, specific work
It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf;
Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one
GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd。
In the training stage, frame stream semantic feature detection model meets following three parameters:(1) parameter θ is foundy, in source domain
Minimize the loss of semantic feature fallout predictor, it is ensured that semantic feature detection model is undistorted in source domain;(2) Feature Mapping is found
Parameter θfSo that by mapping G in source domainfThe feature S of extractionfWith the feature T extracted on aiming fieldfIt is similar, it is distributed SfWith Tf
Similitude pass through computational domain grader GdLoss estimate.It obtains domain invariant features and makes the two feature distributions as far as possible
It is similar, the loss of domain grader is maximized with this.(3) parameter θ of domain grader is foundd, minimize the loss of domain grader.
Here with the thought of confrontation type network.Three parameters met the requirements constitute a point (θf,θy,θd), referred to as saddle point.It is whole
A training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates the
The corresponding loss function assessed on i training sample;Parameter lambda be used to balance the feature in two domains formed during the training period to
Amount;Therefore, saddle point (θf,θy,θd) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5)
Saddle point;
Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by-
λ, and preceding layer is passed it to, this part is the reversed layer of gradient.Frame stream semantic feature detection model mainly by feature extractor,
The reversed layer of gradient and domain grader are constituted.Feature extractor mainly extracts the semantic feature in frame basin, domain grader and gradient
Reversed layer, which combines, merges image area with frame basin semantic feature.After training is completed, predicted using semantic feature
Device goes to predict the semantic feature from aiming field and source domain sample.Due to SfWith TfFor the feature vector that two domains are constant, therefore
Obtained image area is mapped by them and also remains the constant characteristic in domain with the semantic feature on frame basin, i.e. is extracted on two domains
To semantic feature realize depth integration.Therefore, the semantic feature obtained using frame stream semantic feature detection model can be direct
It is denoted as A as the input of video presentation frame, and by the semantic featureiv。
In the step 4), the workflow of entire frame includes the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will
It is input to the first layer of recurrent neural network (Long Short Term Memory network, abbreviation LSTM);
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, such as " Man ", " Person " to
Amount indicates, and by AivIt is input to the second layer of LSTM;
(6) the English description of given video is input to the first layer of LSTM by word, in conjunction with defeated in aforementioned four step
Enter, the output word of prediction subsequent time is removed using current time and before the input word at moment, training video is come with this
Describing framework.
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor list
Word expression, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word
Context relation.
In frame, only video v is input in first layer LSTM units at the t=-1 moment, then by AivAs additional
Input, is input to second layer LSTM units in each iteration, strengthens semantic information with this, such as formula (8), (9), (10)
Shown, t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is
The dimension of LSTM inputs, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htBe respectively second layer LSTM units input with
Output, f1With f2It is first layer and the mapping function in second layer LSTM units respectively.
Compared with prior art, the present invention constructs a new video presentation model.This model utilizes in transfer learning
Depth Domain adaptive method, depth integration is carried out to the semantic feature in input terminal not same area, video presentation is generated to improve
Accuracy rate.It is tested in MSVD sets of video data, demonstrates the feasibility and validity of invention, and show to utilize Depth Domain
The fusion of semantic feature in not same area can be better achieved in adaptive method, further improve the accuracy rate of video presentation, carry
The high generalization ability of network.
Description of the drawings
Fig. 1 is convolutional neural networks representation of video shot model;
Fig. 2 is image, semantic feature detection model;
Fig. 3 is the frame stream semantic feature detection model of the present invention;
Fig. 4 is video presentation frame structure;
Fig. 5 is partial results of the present invention in test data set.
Specific implementation mode
The specific implementation mode of the present invention is described in detail below.
A kind of video presentation method based on depth migration study, includes the following steps,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;Concrete model structure such as Fig. 1
It is shown.
In step 1), the task of representation of video shot is completed using convolutional neural networks model, for one group of sampling in video
Each frame is input in convolutional neural networks model by frame, is extracted the output of second full articulamentum, is then adopted in all
Mean value pond is executed on sample frame, is a n-dimensional vector one section of representation of video shot.
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature.Image language
Adopted feature detection model is as shown in Figure 2.
It is as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as
One positive closure;Otherwise, image I will be considered as a negative packet, and each packet is input in image, semantic feature detection model first
(as shown in Figure 2) then calculates according to the probability of all areas in packet comprising semantic feature waPacket bIProbability, such as formula (1)
It is shown:
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is carried out by one sigmoid layers
It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks are most
The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every
For a packet, x × x dimensional feature figures are obtained, then, model is optimized using cross entropy loss layer, finally, using in image
Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each individual sample frame about all languages
The probability distribution of adopted feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to
The final expression of semantic feature.
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature
Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature.Frame semantic feature detects
Model is as shown in Figure 3.
The domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, the final goal of the model
It is:Distribution for aiming field gives input x, can predict semantic feature y;
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d=
0, then x is from source domain;If d=1, x come from aiming field, semantic feature detection model can be decomposed into three parts, specific work
It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf;
Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one
GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd。
In the training stage, semantic feature detection model meets following three parameters:
(1) parameter θ is foundy, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that semantic feature detection model exists
It is undistorted in source domain;
(2) Feature Mapping parameter θ is foundfSo that by mapping G in source domainfThe feature S of extractionfIt is extracted on aiming field
Feature TfIt is similar, it is distributed SfWith TfSimilitude pass through computational domain grader GdLoss estimate that obtaining domain invariant features makes
It is as similar as possible to obtain the two feature distributions, the loss of domain grader is maximized with this;
(3) parameter θ of domain grader is foundd, minimize the loss of domain grader;Three parameters met the requirements constitute one
A point (θf,θy,θd), referred to as saddle point, entire training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates i-th
The corresponding loss function assessed on a training sample;Parameter lambda be used to balance the feature in two domains formed during the training period to
Amount;Therefore, saddle point (θf,θy,θd) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5)
Saddle point;
Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by-
λ, and preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model is mainly by feature extractor, gradient
Reversed layer and domain grader are constituted, and after training is completed, go prediction to come from aiming field and source domain using semantic feature fallout predictor
The semantic feature of sample, the semantic feature obtained using improved semantic feature detection model can be directly as video presentation frames
Input, and the semantic feature is denoted as Aiv。
4) structure depth migration learns video presentation frame, generates video natural language description.
Include the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will
It is input to the first layer of recurrent neural network;
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, and by AivIt is input to recurrent neural net
The second layer of network;
(6) the English description of given video is input to the first layer of recurrent neural network by word, is walked in conjunction with aforementioned four
The output word of prediction subsequent time is removed in input in rapid using current time and before the input word at moment, is come with this
Training video describing framework.Video presentation frame structure is as shown in Figure 4.
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor list
Word expression, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word
Context relation;
In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by AivMake
For additional input, it is input to second layer recurrent neural network unit in each iteration, semantic information is strengthened with this, such as
Formula (8), (9), (10) are shown, and t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is recurrence
The dimension of neural network input, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htIt is second layer recurrent neural network respectively
The input and output of unit, f1With f2It is first layer and the mapping function in second layer recurrent neural network unit respectively.
Experiment and interpretation of result
Data set:
In order to evaluate the video presentation model of the present invention, most popular video description data collection MSVD on YouTube is selected.
MSVD includes 1970 video clips being collected into from YouTube.There are about 40 available English descriptions for each video.?
It in experiment, is trained using 1200 videos, 100 videos are verified, and 670 videos are tested.In addition, also using
Image data set COCO.
Evaluation index:
For the video presentation frame that quantitative assessment proposes, there is employed herein common three fingers in video presentation task
Mark:BLEU@N (BiLingual Evaluation Understudy), METEOR and CIDEr-D (Consensus-based
Image Description Evaluation).For BLEU@N indexs, N takes 3,4.Service hair is evaluated using by Microsoft Coco
The code of cloth calculates all indexs.The result of calculation of these three indexs is percentage, and score is higher to indicate that is generated regards
Frequency is described closer to reference to description.
Experimental setup:
The present invention is expressed as " one-hot " vector to each 25 frame of video uniform sampling, and by each word in sentence;
For representation of video shot, pre-training is carried out to VGG19 on Imagenet ILSVRC12 data sets, then to Fig. 1 on MSVD
Shown in model be finely adjusted;In order to indicate the fusion semantic feature extracted from two domains, respectively in COCO image datas
Mark semantic feature of 1000 most common words as two domains is selected in collection and MSVD sets of video data[4], as scheme
2 with the training datasets of two models of Fig. 3.Fig. 2 models are trained on COCO training sets first, then COCO with
Fig. 3 models are trained on two training sets of MSVD, generate 1000 final dimension probability vectors;In LSTM, input and
The dimension of hidden layer is disposed as 1024.In test phase, using Beam Search search strategies, using being trained in Fig. 4
Model generate new video sentence description, and beam size are set as 4.
Quantitative analysis:
Table 1 illustrates in MSVD test data sets, and video presentation model proposed in this paper exists with existing seven kinds of models
Score comparative situation in each evaluation index.Simulation result is all different obtained by different configuration of machine, institute's column data in table
It is reference with same machine.
The score of 1 each model of table compares
Tab.1 Score comparison of each model
The method based on attention is utilized in model 1-4 in table, is not introduced into semantic feature;Model 5,6 is merely with single
The semantic feature in domain;The semantic feature in two domains is utilized in model 7, and has been carried out simple linear fusion.Comparative analysis
Data in table, it can be seen that:In four evaluation indexes, video presentation model proposed in this paper obtains higher score.
It follows that:1) in video presentation frame, using high-level semantics feature, visual representation can be enhanced, be conducive to model
Learn video presentation;2) unapparent to video presentation performance merely with the semantic feature in single domain (image area or frame basin)
It improves;3) simple linear fusion only is carried out to semantic feature in two domains, although improving the indices of video presentation, but still
Shortcomings need to improve;4) it is improved using the fusion semanteme characteristic remarkable that the Depth Domain adaptive method in transfer learning obtains
Video presentation performance, the i.e. present invention have better effect in terms of semantic feature fusion.
Qualitative analysis:
Fig. 5 illustrates partial results of the video presentation model proposed in this paper in test data set.
Exemplary sample frame is the partial frame of each test video in figure, be can be seen that by these examples and performance
Preferable LSTM-TSAIV models compare, and video presentation frame proposed in this paper can more accurately generate test video
English description.
Claims (7)
1. a kind of video presentation method based on depth migration study, it is characterised in that:Include the following steps,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature;
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains new semantic feature detection mould
Type to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature;
4) structure depth migration learns video presentation frame, generates video natural language description.
2. the video presentation method according to claim 1 based on depth migration study, it is characterised in that:The step
1) in, the task of representation of video shot is completed using convolutional neural networks model, it is for one group of sample frame in video, each frame is equal
It is input in convolutional neural networks model, extracts the output of second full articulamentum, then executed in all sample frames equal
It is worth pond, is a n-dimensional vector one section of representation of video shot.
3. the video presentation method according to claim 2 based on depth migration study, it is characterised in that:The step
2) in:
It is as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as one
Positive closure;Otherwise, image I will be considered as a negative packet, each packet is input in image, semantic feature detection model first, then
It is calculated comprising semantic feature w according to the probability of all areas in packetaPacket bIProbability, as shown in formula (1):
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is calculated by one sigmoid layers,
The sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks the last one
The dimension of the activation primitive of convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore each packet is come
It says, obtains x × x dimensional feature figures, then, model is optimized using cross entropy loss layer, finally, using in iamge description number
Image, semantic feature detection model is obtained according to training on collection, each individual sample frame is calculated about all semantic features respectively
Probability distribution, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to semantic spy
The final expression of sign.
4. the video presentation method according to claim 3 based on depth migration study, it is characterised in that:The step
3) domain that, image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, and the final goal of the model is:It is right
In the distribution of aiming field, input x is given, can predict semantic feature y;
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d=0,
X comes from source domain;If d=1, x come from aiming field, frame stream semantic feature detection model can be decomposed into three parts, specific work
It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf;
Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one
GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd。
5. the video presentation method according to claim 4 based on depth migration study, it is characterised in that:
In the training process, frame stream semantic feature detection model meets following three parameters:
(1) parameter θ is foundy, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that frame stream semantic feature detection model exists
It is undistorted in source domain;
(2) Feature Mapping parameter θ is foundfSo that by mapping G in source domainfThe feature S of extractionfWith the spy extracted on aiming field
Levy TfIt is similar, it is distributed SfWith TfSimilitude pass through computational domain grader GdLoss estimate, obtain domain invariant features and make this
Two feature distributions are as similar as possible, and the loss of domain grader is maximized with this;
(3) parameter θ of domain grader is foundd, minimize the loss of domain grader;Three parameters met the requirements constitute a point
(θf,θy,θd), referred to as saddle point, entire training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates to instruct at i-th
Practice the corresponding loss function assessed on sample;Parameter lambda is used to balance the feature vector in two domains formed during the training period;Cause
This, saddle point (θf,θy,θd) can be solved by formula (2), saddle point is searched for using the method as shown in formula (3), (4), (5);
Wherein μ is learning rate, and during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by-λ, and
Preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model includes feature extractor, the reversed layer of gradient
With domain grader, feature extractor extracts the semantic feature in frame basin, and domain grader and the reversed layer of gradient combine to figure
Image field is merged with frame basin semantic feature, after training is completed, goes prediction to come from mesh using frame stream semantic feature fallout predictor
The semantic feature for marking domain and source domain sample, the semantic feature obtained using frame stream semantic feature detection model can directly as regarding
The input of frequency describing framework, and the semantic feature is denoted as Aiv。
6. the video presentation method according to claim 5 based on depth migration study, it is characterised in that:The step
4) in:
Include the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, only carves at the beginning its is defeated
Enter the first layer to recurrent neural network;
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, and by AivIt is input to recurrent neural network
The second layer;
(6) the English description of given video is input to the first layer of recurrent neural network by word, in conjunction in aforementioned four step
Input, the output word of prediction subsequent time is removed using current time and before the input word at moment, is trained with this
Video presentation frame.
7. the video presentation method according to claim 6 based on depth migration study, it is characterised in that:The step
In 4,
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor word list
Show, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained upper between word in sentence
Hereafter relationship;
In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by AivAs volume
Outer input is input to second layer recurrent neural network unit in each iteration, strengthens semantic information with this, such as formula
(8), (9), (10) are shown, and t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is recurrent neural net
The dimension of network input, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htIt is second layer recurrent neural network unit respectively
Input and output, f1With f2It is first layer and the mapping function in second layer recurrent neural network unit respectively.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2018102507521 | 2018-03-22 | ||
CN201810250752 | 2018-03-22 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108765383A true CN108765383A (en) | 2018-11-06 |
CN108765383B CN108765383B (en) | 2022-03-18 |
Family
ID=64008024
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810465849.4A Active CN108765383B (en) | 2018-03-22 | 2018-05-15 | Video description method based on deep migration learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108765383B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109919114A (en) * | 2019-03-14 | 2019-06-21 | 浙江大学 | One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution |
CN110084296A (en) * | 2019-04-22 | 2019-08-02 | 中山大学 | A kind of figure expression learning framework and its multi-tag classification method based on certain semantic |
CN110166850A (en) * | 2019-05-30 | 2019-08-23 | 上海交通大学 | The method and system of multiple CNN neural network forecast panoramic video viewing location |
CN110363164A (en) * | 2019-07-18 | 2019-10-22 | 南京工业大学 | A kind of unified approach based on LSTM time consistency video analysis |
CN110909736A (en) * | 2019-11-12 | 2020-03-24 | 北京工业大学 | Image description method based on long-short term memory model and target detection algorithm |
CN111435453A (en) * | 2019-01-14 | 2020-07-21 | 中国科学技术大学 | Fine-grained image zero sample identification method |
CN111464881A (en) * | 2019-01-18 | 2020-07-28 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN111988673A (en) * | 2020-07-31 | 2020-11-24 | 清华大学 | Video description statement generation method and related equipment |
CN113177478A (en) * | 2021-04-29 | 2021-07-27 | 西华大学 | Short video semantic annotation method based on transfer learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130282747A1 (en) * | 2012-04-23 | 2013-10-24 | Sri International | Classification, search, and retrieval of complex video events |
CN104915400A (en) * | 2015-05-29 | 2015-09-16 | 山西大学 | Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT) |
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN105976401A (en) * | 2016-05-20 | 2016-09-28 | 河北工业职业技术学院 | Target tracking method and system based on partitioned multi-example learning algorithm |
CN106202256A (en) * | 2016-06-29 | 2016-12-07 | 西安电子科技大学 | Propagate based on semanteme and mix the Web graph of multi-instance learning as search method |
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
-
2018
- 2018-05-15 CN CN201810465849.4A patent/CN108765383B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130282747A1 (en) * | 2012-04-23 | 2013-10-24 | Sri International | Classification, search, and retrieval of complex video events |
CN104915400A (en) * | 2015-05-29 | 2015-09-16 | 山西大学 | Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT) |
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN105976401A (en) * | 2016-05-20 | 2016-09-28 | 河北工业职业技术学院 | Target tracking method and system based on partitioned multi-example learning algorithm |
CN106202256A (en) * | 2016-06-29 | 2016-12-07 | 西安电子科技大学 | Propagate based on semanteme and mix the Web graph of multi-instance learning as search method |
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
Non-Patent Citations (6)
Title |
---|
GANIN Y等: "Unsupervised Domain Adaptation by Backpropagation", 《ICML"15: PROCEEDINGS OF THE 32ND INTERNATIONAL CONFERENCE ON INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 * |
HASSAN ALAM等: "Multi-lingual author identification and linguistic feature extraction — A machine learning approach", 《2013 IEEE INTERNATIONAL CONFERENCE ON TECHNOLOGIES FOR HOMELAND SECURITY (HST)》 * |
Q YOU等: "Image Captioning with Semantic Attention", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
刘宇鹏等: "统计机器翻译中大规模特征的深度融合", 《浙江大学学报》 * |
惠开发等: "基于多核属性学习的视频多概念检测研究", 《软件导刊》 * |
易文晟: "图像语义检索和分类技术研究", 《中国博士学位论文全文数据库 (信息科技辑)》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111435453A (en) * | 2019-01-14 | 2020-07-21 | 中国科学技术大学 | Fine-grained image zero sample identification method |
CN111435453B (en) * | 2019-01-14 | 2022-07-22 | 中国科学技术大学 | Fine-grained image zero sample identification method |
CN111464881A (en) * | 2019-01-18 | 2020-07-28 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN109919114A (en) * | 2019-03-14 | 2019-06-21 | 浙江大学 | One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution |
CN110084296A (en) * | 2019-04-22 | 2019-08-02 | 中山大学 | A kind of figure expression learning framework and its multi-tag classification method based on certain semantic |
CN110166850A (en) * | 2019-05-30 | 2019-08-23 | 上海交通大学 | The method and system of multiple CNN neural network forecast panoramic video viewing location |
CN110166850B (en) * | 2019-05-30 | 2020-11-06 | 上海交通大学 | Method and system for predicting panoramic video watching position by multiple CNN networks |
CN110363164A (en) * | 2019-07-18 | 2019-10-22 | 南京工业大学 | A kind of unified approach based on LSTM time consistency video analysis |
CN110909736A (en) * | 2019-11-12 | 2020-03-24 | 北京工业大学 | Image description method based on long-short term memory model and target detection algorithm |
CN111988673A (en) * | 2020-07-31 | 2020-11-24 | 清华大学 | Video description statement generation method and related equipment |
CN111988673B (en) * | 2020-07-31 | 2023-05-23 | 清华大学 | Method and related equipment for generating video description sentences |
CN113177478A (en) * | 2021-04-29 | 2021-07-27 | 西华大学 | Short video semantic annotation method based on transfer learning |
Also Published As
Publication number | Publication date |
---|---|
CN108765383B (en) | 2022-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108765383A (en) | Video presentation method based on depth migration study | |
Huang et al. | Facial expression recognition with grid-wise attention and visual transformer | |
CN110750959B (en) | Text information processing method, model training method and related device | |
Karpathy et al. | Deep visual-semantic alignments for generating image descriptions | |
CN105183720B (en) | Machine translation method and device based on RNN model | |
CN111444343B (en) | Cross-border national culture text classification method based on knowledge representation | |
CN107480132A (en) | A kind of classic poetry generation method of image content-based | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN106202044A (en) | A kind of entity relation extraction method based on deep neural network | |
Ding et al. | Progressive multimodal interaction network for referring video object segmentation | |
Liu et al. | Video captioning with listwise supervision | |
Shen et al. | Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. | |
CN109947923A (en) | A kind of elementary mathematics topic type extraction method and system based on term vector | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN107391565A (en) | A kind of across language hierarchy taxonomic hierarchies matching process based on topic model | |
Saleem et al. | Stateful human-centered visual captioning system to aid video surveillance | |
Yang et al. | Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system | |
Sheng et al. | Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos | |
Wang et al. | RETRACTED ARTICLE: Human behaviour recognition and monitoring based on deep convolutional neural networks | |
CN106709277A (en) | Text-mining-based vector generating method of G-protein coupled receptor drug target molecules | |
Chen et al. | Multi-modal feature fusion based on variational autoencoder for visual question answering | |
Xu et al. | MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation | |
Mi et al. | Multiple Domain-Adversarial Ensemble Learning for Domain Generalization | |
Vecchi et al. | Transferring multiple text styles using CycleGAN with supervised style latent space | |
CN116721176B (en) | Text-to-face image generation method and device based on CLIP supervision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |