CN108765383A - Video presentation method based on depth migration study - Google Patents

Video presentation method based on depth migration study Download PDF

Info

Publication number
CN108765383A
CN108765383A CN201810465849.4A CN201810465849A CN108765383A CN 108765383 A CN108765383 A CN 108765383A CN 201810465849 A CN201810465849 A CN 201810465849A CN 108765383 A CN108765383 A CN 108765383A
Authority
CN
China
Prior art keywords
semantic feature
video
frame
input
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810465849.4A
Other languages
Chinese (zh)
Other versions
CN108765383B (en
Inventor
张丽红
曹刘彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Publication of CN108765383A publication Critical patent/CN108765383A/en
Application granted granted Critical
Publication of CN108765383B publication Critical patent/CN108765383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20164Salient point detection; Corner detection

Abstract

The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.Include the following steps, 1)It is vector form by representation of video shot by convolutional neural networks representation of video shot model;2)Image semantic feature detection model is built using multi-instance learning, to extract image area semantic feature;3)By step 2)In image, semantic feature detection model move in frame basin, obtain frame stream semantic feature detection model, to extract frame stream semantic feature, and realize the depth integration of image area and frame basin semantic feature;4)It builds depth migration and learns video presentation frame, generate video natural language description.The present invention carries out depth integration to the semantic feature in input terminal not same area, to improve the accuracy rate for generating video presentation.

Description

Video presentation method based on depth migration study
Technical field
The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.
Background technology
Video presentation is to utilize natural language description video, is the emphasis of computer vision and natural language processing field And difficult point, it is had broad application prospects in artificial intelligence field.
Video presentation is very different with iamge description, and video presentation is not only appreciated that the object in each frame, but also It is appreciated that movement of the object between multiframe.Existing video presentation method mainly has following four classes:1) it will be examined in vision content The word measured distributes to each sentence fragment, then goes to generate video presentation using predefined language template.Such methods Highly dependent upon sentence template, the syntactic structure of the sentence of generation is relatively more fixed;2) study vision content is constituted with text sentence The sentence of the probability distribution of joint space, generation has more flexible syntactic structure;3) it goes to train category using multi-instance learning Property detector, then by one based on attribute detector output maximum entropy language model go generate video presentation;4) with volume Centered on product neural network and Recognition with Recurrent Neural Network, by a simple linear transport unit, being dug from image with frame stream The semantic feature dug combines, and generates video presentation.Preceding two classes method does not utilize semanteme during video presentation Feature;Although two class methods consider semantic feature in input terminal afterwards, the semantic feature in not same area is not carried out Depth integration.
Existing video presentation method descriptive semantics are not accurate enough, to improve the accuracy of description, therefore devise a kind of depth Spend transfer learning video presentation model.
Invention content
The present invention to solve the above-mentioned problems, provides a kind of video presentation method learnt based on depth migration.
The present invention takes following technical scheme:A kind of video presentation method based on depth migration study, including following step Suddenly,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature;
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature;
4) structure depth migration learns video presentation frame, generates video natural language description.
In the step 1), the task of representation of video shot is completed using convolutional neural networks model, for one in video Group sample frame, each frame is input in convolutional neural networks model, the output of second full articulamentum is extracted, then in institute Mean value pond is executed in some sample frames, is a n-dimensional vector one section of representation of video shot.
In the step 2), structure image semantic feature is gone using multi-instance learning on iamge description standard database Detection model.
It is specific as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as One positive closure;Otherwise, image I will be considered as a negative packet.Each packet is input in image, semantic feature detection model first, Each packet is divided into multiple regions by full convolutional neural networks, packet is then calculated according to the probability of all areas in packet (example) W containing semantic featureaPacket bIProbability, as shown in formula (1):
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is carried out by one sigmoid layers It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks.In addition, full convolutional neural networks are most The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every For a packet, x × x dimensional feature figures are obtained.Then, model is optimized using cross entropy loss layer.Finally, using in image Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each independent sample frame about all semantemes The probability distribution of feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to language The final expression of adopted feature.
In the step 3), the domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, should The final goal of model is:Distribution for aiming field gives input x, can predict semantic feature y.
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d= 0, then x is from source domain;If d=1, x come from aiming field, semantic feature detection model can be decomposed into three parts, specific work It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf; Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd
In the training stage, frame stream semantic feature detection model meets following three parameters:(1) parameter θ is foundy, in source domain Minimize the loss of semantic feature fallout predictor, it is ensured that semantic feature detection model is undistorted in source domain;(2) Feature Mapping is found Parameter θfSo that by mapping G in source domainfThe feature S of extractionfWith the feature T extracted on aiming fieldfIt is similar, it is distributed SfWith Tf Similitude pass through computational domain grader GdLoss estimate.It obtains domain invariant features and makes the two feature distributions as far as possible It is similar, the loss of domain grader is maximized with this.(3) parameter θ of domain grader is foundd, minimize the loss of domain grader. Here with the thought of confrontation type network.Three parameters met the requirements constitute a point (θfyd), referred to as saddle point.It is whole A training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates the The corresponding loss function assessed on i training sample;Parameter lambda be used to balance the feature in two domains formed during the training period to Amount;Therefore, saddle point (θfyd) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5) Saddle point;
Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by- λ, and preceding layer is passed it to, this part is the reversed layer of gradient.Frame stream semantic feature detection model mainly by feature extractor, The reversed layer of gradient and domain grader are constituted.Feature extractor mainly extracts the semantic feature in frame basin, domain grader and gradient Reversed layer, which combines, merges image area with frame basin semantic feature.After training is completed, predicted using semantic feature Device goes to predict the semantic feature from aiming field and source domain sample.Due to SfWith TfFor the feature vector that two domains are constant, therefore Obtained image area is mapped by them and also remains the constant characteristic in domain with the semantic feature on frame basin, i.e. is extracted on two domains To semantic feature realize depth integration.Therefore, the semantic feature obtained using frame stream semantic feature detection model can be direct It is denoted as A as the input of video presentation frame, and by the semantic featureiv
In the step 4), the workflow of entire frame includes the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will It is input to the first layer of recurrent neural network (Long Short Term Memory network, abbreviation LSTM);
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, such as " Man ", " Person " to Amount indicates, and by AivIt is input to the second layer of LSTM;
(6) the English description of given video is input to the first layer of LSTM by word, in conjunction with defeated in aforementioned four step Enter, the output word of prediction subsequent time is removed using current time and before the input word at moment, training video is come with this Describing framework.
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor list Word expression, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word Context relation.
In frame, only video v is input in first layer LSTM units at the t=-1 moment, then by AivAs additional Input, is input to second layer LSTM units in each iteration, strengthens semantic information with this, such as formula (8), (9), (10) Shown, t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is The dimension of LSTM inputs, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htBe respectively second layer LSTM units input with Output, f1With f2It is first layer and the mapping function in second layer LSTM units respectively.
Compared with prior art, the present invention constructs a new video presentation model.This model utilizes in transfer learning Depth Domain adaptive method, depth integration is carried out to the semantic feature in input terminal not same area, video presentation is generated to improve Accuracy rate.It is tested in MSVD sets of video data, demonstrates the feasibility and validity of invention, and show to utilize Depth Domain The fusion of semantic feature in not same area can be better achieved in adaptive method, further improve the accuracy rate of video presentation, carry The high generalization ability of network.
Description of the drawings
Fig. 1 is convolutional neural networks representation of video shot model;
Fig. 2 is image, semantic feature detection model;
Fig. 3 is the frame stream semantic feature detection model of the present invention;
Fig. 4 is video presentation frame structure;
Fig. 5 is partial results of the present invention in test data set.
Specific implementation mode
The specific implementation mode of the present invention is described in detail below.
A kind of video presentation method based on depth migration study, includes the following steps,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;Concrete model structure such as Fig. 1 It is shown.
In step 1), the task of representation of video shot is completed using convolutional neural networks model, for one group of sampling in video Each frame is input in convolutional neural networks model by frame, is extracted the output of second full articulamentum, is then adopted in all Mean value pond is executed on sample frame, is a n-dimensional vector one section of representation of video shot.
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature.Image language Adopted feature detection model is as shown in Figure 2.
It is as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as One positive closure;Otherwise, image I will be considered as a negative packet, and each packet is input in image, semantic feature detection model first (as shown in Figure 2) then calculates according to the probability of all areas in packet comprising semantic feature waPacket bIProbability, such as formula (1) It is shown:
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is carried out by one sigmoid layers It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks are most The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every For a packet, x × x dimensional feature figures are obtained, then, model is optimized using cross entropy loss layer, finally, using in image Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each individual sample frame about all languages The probability distribution of adopted feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to The final expression of semantic feature.
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature.Frame semantic feature detects Model is as shown in Figure 3.
The domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, the final goal of the model It is:Distribution for aiming field gives input x, can predict semantic feature y;
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d= 0, then x is from source domain;If d=1, x come from aiming field, semantic feature detection model can be decomposed into three parts, specific work It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf; Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd
In the training stage, semantic feature detection model meets following three parameters:
(1) parameter θ is foundy, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that semantic feature detection model exists It is undistorted in source domain;
(2) Feature Mapping parameter θ is foundfSo that by mapping G in source domainfThe feature S of extractionfIt is extracted on aiming field Feature TfIt is similar, it is distributed SfWith TfSimilitude pass through computational domain grader GdLoss estimate that obtaining domain invariant features makes It is as similar as possible to obtain the two feature distributions, the loss of domain grader is maximized with this;
(3) parameter θ of domain grader is foundd, minimize the loss of domain grader;Three parameters met the requirements constitute one A point (θfyd), referred to as saddle point, entire training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates i-th The corresponding loss function assessed on a training sample;Parameter lambda be used to balance the feature in two domains formed during the training period to Amount;Therefore, saddle point (θfyd) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5) Saddle point;
Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by- λ, and preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model is mainly by feature extractor, gradient Reversed layer and domain grader are constituted, and after training is completed, go prediction to come from aiming field and source domain using semantic feature fallout predictor The semantic feature of sample, the semantic feature obtained using improved semantic feature detection model can be directly as video presentation frames Input, and the semantic feature is denoted as Aiv
4) structure depth migration learns video presentation frame, generates video natural language description.
Include the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will It is input to the first layer of recurrent neural network;
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, and by AivIt is input to recurrent neural net The second layer of network;
(6) the English description of given video is input to the first layer of recurrent neural network by word, is walked in conjunction with aforementioned four The output word of prediction subsequent time is removed in input in rapid using current time and before the input word at moment, is come with this Training video describing framework.Video presentation frame structure is as shown in Figure 4.
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor list Word expression, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word Context relation;
In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by AivMake For additional input, it is input to second layer recurrent neural network unit in each iteration, semantic information is strengthened with this, such as Formula (8), (9), (10) are shown, and t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is recurrence The dimension of neural network input, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htIt is second layer recurrent neural network respectively The input and output of unit, f1With f2It is first layer and the mapping function in second layer recurrent neural network unit respectively.
Experiment and interpretation of result
Data set:
In order to evaluate the video presentation model of the present invention, most popular video description data collection MSVD on YouTube is selected. MSVD includes 1970 video clips being collected into from YouTube.There are about 40 available English descriptions for each video.? It in experiment, is trained using 1200 videos, 100 videos are verified, and 670 videos are tested.In addition, also using Image data set COCO.
Evaluation index:
For the video presentation frame that quantitative assessment proposes, there is employed herein common three fingers in video presentation task Mark:BLEU@N (BiLingual Evaluation Understudy), METEOR and CIDEr-D (Consensus-based Image Description Evaluation).For BLEU@N indexs, N takes 3,4.Service hair is evaluated using by Microsoft Coco The code of cloth calculates all indexs.The result of calculation of these three indexs is percentage, and score is higher to indicate that is generated regards Frequency is described closer to reference to description.
Experimental setup:
The present invention is expressed as " one-hot " vector to each 25 frame of video uniform sampling, and by each word in sentence; For representation of video shot, pre-training is carried out to VGG19 on Imagenet ILSVRC12 data sets, then to Fig. 1 on MSVD Shown in model be finely adjusted;In order to indicate the fusion semantic feature extracted from two domains, respectively in COCO image datas Mark semantic feature of 1000 most common words as two domains is selected in collection and MSVD sets of video data[4], as scheme 2 with the training datasets of two models of Fig. 3.Fig. 2 models are trained on COCO training sets first, then COCO with Fig. 3 models are trained on two training sets of MSVD, generate 1000 final dimension probability vectors;In LSTM, input and The dimension of hidden layer is disposed as 1024.In test phase, using Beam Search search strategies, using being trained in Fig. 4 Model generate new video sentence description, and beam size are set as 4.
Quantitative analysis:
Table 1 illustrates in MSVD test data sets, and video presentation model proposed in this paper exists with existing seven kinds of models Score comparative situation in each evaluation index.Simulation result is all different obtained by different configuration of machine, institute's column data in table It is reference with same machine.
The score of 1 each model of table compares
Tab.1 Score comparison of each model
The method based on attention is utilized in model 1-4 in table, is not introduced into semantic feature;Model 5,6 is merely with single The semantic feature in domain;The semantic feature in two domains is utilized in model 7, and has been carried out simple linear fusion.Comparative analysis Data in table, it can be seen that:In four evaluation indexes, video presentation model proposed in this paper obtains higher score. It follows that:1) in video presentation frame, using high-level semantics feature, visual representation can be enhanced, be conducive to model Learn video presentation;2) unapparent to video presentation performance merely with the semantic feature in single domain (image area or frame basin) It improves;3) simple linear fusion only is carried out to semantic feature in two domains, although improving the indices of video presentation, but still Shortcomings need to improve;4) it is improved using the fusion semanteme characteristic remarkable that the Depth Domain adaptive method in transfer learning obtains Video presentation performance, the i.e. present invention have better effect in terms of semantic feature fusion.
Qualitative analysis:
Fig. 5 illustrates partial results of the video presentation model proposed in this paper in test data set.
Exemplary sample frame is the partial frame of each test video in figure, be can be seen that by these examples and performance Preferable LSTM-TSAIV models compare, and video presentation frame proposed in this paper can more accurately generate test video English description.

Claims (7)

1. a kind of video presentation method based on depth migration study, it is characterised in that:Include the following steps,
1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model;
2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature;
3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains new semantic feature detection mould Type to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature;
4) structure depth migration learns video presentation frame, generates video natural language description.
2. the video presentation method according to claim 1 based on depth migration study, it is characterised in that:The step 1) in, the task of representation of video shot is completed using convolutional neural networks model, it is for one group of sample frame in video, each frame is equal It is input in convolutional neural networks model, extracts the output of second full articulamentum, then executed in all sample frames equal It is worth pond, is a n-dimensional vector one section of representation of video shot.
3. the video presentation method according to claim 2 based on depth migration study, it is characterised in that:The step 2) in:
It is as follows:
For a semantic feature waIf waIt is present in the mark text description of image I, then image I will be considered as one Positive closure;Otherwise, image I will be considered as a negative packet, each packet is input in image, semantic feature detection model first, then It is calculated comprising semantic feature w according to the probability of all areas in packetaPacket bIProbability, as shown in formula (1):
Wherein,It is feature waProbability, this probability is by region riIt is predicted, is calculated by one sigmoid layers, The sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks the last one The dimension of the activation primitive of convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore each packet is come It says, obtains x × x dimensional feature figures, then, model is optimized using cross entropy loss layer, finally, using in iamge description number Image, semantic feature detection model is obtained according to training on collection, each individual sample frame is calculated about all semantic features respectively Probability distribution, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to semantic spy The final expression of sign.
4. the video presentation method according to claim 3 based on depth migration study, it is characterised in that:The step 3) domain that, image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, and the final goal of the model is:It is right In the distribution of aiming field, input x is given, can predict semantic feature y;
It is specific as follows:
In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed;If d=0, X comes from source domain;If d=1, x come from aiming field, frame stream semantic feature detection model can be decomposed into three parts, specific work It is as process:First, by mapping GfInput x is mapped as a D dimensional feature vector f ∈ RD, the parameter vector of mapping is θf; Then, by mapping GyFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θy;Finally, it is mapped by one GdIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θd
5. the video presentation method according to claim 4 based on depth migration study, it is characterised in that:
In the training process, frame stream semantic feature detection model meets following three parameters:
(1) parameter θ is foundy, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that frame stream semantic feature detection model exists It is undistorted in source domain;
(2) Feature Mapping parameter θ is foundfSo that by mapping G in source domainfThe feature S of extractionfWith the spy extracted on aiming field Levy TfIt is similar, it is distributed SfWith TfSimilitude pass through computational domain grader GdLoss estimate, obtain domain invariant features and make this Two feature distributions are as similar as possible, and the loss of domain grader is maximized with this;
(3) parameter θ of domain grader is foundd, minimize the loss of domain grader;Three parameters met the requirements constitute a point (θfyd), referred to as saddle point, entire training process can be expressed as formula (2):
Wherein, Ly(,) it is the loss that semantic feature is predicted;Ld(,) it is the loss that domain is classified,WithIt indicates to instruct at i-th Practice the corresponding loss function assessed on sample;Parameter lambda is used to balance the feature vector in two domains formed during the training period;Cause This, saddle point (θfyd) can be solved by formula (2), saddle point is searched for using the method as shown in formula (3), (4), (5);
Wherein μ is learning rate, and during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by-λ, and Preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model includes feature extractor, the reversed layer of gradient With domain grader, feature extractor extracts the semantic feature in frame basin, and domain grader and the reversed layer of gradient combine to figure Image field is merged with frame basin semantic feature, after training is completed, goes prediction to come from mesh using frame stream semantic feature fallout predictor The semantic feature for marking domain and source domain sample, the semantic feature obtained using frame stream semantic feature detection model can directly as regarding The input of frequency describing framework, and the semantic feature is denoted as Aiv
6. the video presentation method according to claim 5 based on depth migration study, it is characterised in that:The step 4) in:
Include the following steps:
(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, only carves at the beginning its is defeated Enter the first layer to recurrent neural network;
(2) the training image semantic feature detection model on image data set;
(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model;
(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model;
(5) fusion semantic feature A is obtained using frame stream semantic feature detection modeliv, and by AivIt is input to recurrent neural network The second layer;
(6) the English description of given video is input to the first layer of recurrent neural network by word, in conjunction in aforementioned four step Input, the output word of prediction subsequent time is removed using current time and before the input word at moment, is trained with this Video presentation frame.
7. the video presentation method according to claim 6 based on depth migration study, it is characterised in that:The step In 4,
Model structure represented by entire frame is described by formula (6), (7),
E(v,Aiv, S) and=- logP (S | v, Aiv) (6)
Wherein, v is input video, AivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, wtFor word list Show, NsFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained upper between word in sentence Hereafter relationship;
In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by AivAs volume Outer input is input to second layer recurrent neural network unit in each iteration, strengthens semantic information with this, such as formula (8), (9), (10) are shown, and t is from 0 to Ns- 1 is iterated:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
Wherein,WithIt is the transformation matrix and w of video v respectivelytTransformation matrix, DeIt is recurrent neural net The dimension of network input, DvIt is the dimension of video v, DwIt is wtDimension, xtWith htIt is second layer recurrent neural network unit respectively Input and output, f1With f2It is first layer and the mapping function in second layer recurrent neural network unit respectively.
CN201810465849.4A 2018-03-22 2018-05-15 Video description method based on deep migration learning Active CN108765383B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018102507521 2018-03-22
CN201810250752 2018-03-22

Publications (2)

Publication Number Publication Date
CN108765383A true CN108765383A (en) 2018-11-06
CN108765383B CN108765383B (en) 2022-03-18

Family

ID=64008024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465849.4A Active CN108765383B (en) 2018-03-22 2018-05-15 Video description method based on deep migration learning

Country Status (1)

Country Link
CN (1) CN108765383B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919114A (en) * 2019-03-14 2019-06-21 浙江大学 One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110166850A (en) * 2019-05-30 2019-08-23 上海交通大学 The method and system of multiple CNN neural network forecast panoramic video viewing location
CN110363164A (en) * 2019-07-18 2019-10-22 南京工业大学 A kind of unified approach based on LSTM time consistency video analysis
CN110909736A (en) * 2019-11-12 2020-03-24 北京工业大学 Image description method based on long-short term memory model and target detection algorithm
CN111435453A (en) * 2019-01-14 2020-07-21 中国科学技术大学 Fine-grained image zero sample identification method
CN111464881A (en) * 2019-01-18 2020-07-28 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN111988673A (en) * 2020-07-31 2020-11-24 清华大学 Video description statement generation method and related equipment
CN113177478A (en) * 2021-04-29 2021-07-27 西华大学 Short video semantic annotation method based on transfer learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282747A1 (en) * 2012-04-23 2013-10-24 Sri International Classification, search, and retrieval of complex video events
CN104915400A (en) * 2015-05-29 2015-09-16 山西大学 Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT)
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105976401A (en) * 2016-05-20 2016-09-28 河北工业职业技术学院 Target tracking method and system based on partitioned multi-example learning algorithm
CN106202256A (en) * 2016-06-29 2016-12-07 西安电子科技大学 Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282747A1 (en) * 2012-04-23 2013-10-24 Sri International Classification, search, and retrieval of complex video events
CN104915400A (en) * 2015-05-29 2015-09-16 山西大学 Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT)
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105976401A (en) * 2016-05-20 2016-09-28 河北工业职业技术学院 Target tracking method and system based on partitioned multi-example learning algorithm
CN106202256A (en) * 2016-06-29 2016-12-07 西安电子科技大学 Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GANIN Y等: "Unsupervised Domain Adaptation by Backpropagation", 《ICML"15: PROCEEDINGS OF THE 32ND INTERNATIONAL CONFERENCE ON INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 *
HASSAN ALAM等: "Multi-lingual author identification and linguistic feature extraction — A machine learning approach", 《2013 IEEE INTERNATIONAL CONFERENCE ON TECHNOLOGIES FOR HOMELAND SECURITY (HST)》 *
Q YOU等: "Image Captioning with Semantic Attention", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
刘宇鹏等: "统计机器翻译中大规模特征的深度融合", 《浙江大学学报》 *
惠开发等: "基于多核属性学习的视频多概念检测研究", 《软件导刊》 *
易文晟: "图像语义检索和分类技术研究", 《中国博士学位论文全文数据库 (信息科技辑)》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435453A (en) * 2019-01-14 2020-07-21 中国科学技术大学 Fine-grained image zero sample identification method
CN111435453B (en) * 2019-01-14 2022-07-22 中国科学技术大学 Fine-grained image zero sample identification method
CN111464881A (en) * 2019-01-18 2020-07-28 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN109919114A (en) * 2019-03-14 2019-06-21 浙江大学 One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110166850A (en) * 2019-05-30 2019-08-23 上海交通大学 The method and system of multiple CNN neural network forecast panoramic video viewing location
CN110166850B (en) * 2019-05-30 2020-11-06 上海交通大学 Method and system for predicting panoramic video watching position by multiple CNN networks
CN110363164A (en) * 2019-07-18 2019-10-22 南京工业大学 A kind of unified approach based on LSTM time consistency video analysis
CN110909736A (en) * 2019-11-12 2020-03-24 北京工业大学 Image description method based on long-short term memory model and target detection algorithm
CN111988673A (en) * 2020-07-31 2020-11-24 清华大学 Video description statement generation method and related equipment
CN111988673B (en) * 2020-07-31 2023-05-23 清华大学 Method and related equipment for generating video description sentences
CN113177478A (en) * 2021-04-29 2021-07-27 西华大学 Short video semantic annotation method based on transfer learning

Also Published As

Publication number Publication date
CN108765383B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN108765383A (en) Video presentation method based on depth migration study
Huang et al. Facial expression recognition with grid-wise attention and visual transformer
CN110750959B (en) Text information processing method, model training method and related device
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
CN105183720B (en) Machine translation method and device based on RNN model
CN111444343B (en) Cross-border national culture text classification method based on knowledge representation
CN107480132A (en) A kind of classic poetry generation method of image content-based
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN106202044A (en) A kind of entity relation extraction method based on deep neural network
Ding et al. Progressive multimodal interaction network for referring video object segmentation
Liu et al. Video captioning with listwise supervision
Shen et al. Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description.
CN109947923A (en) A kind of elementary mathematics topic type extraction method and system based on term vector
CN111582506A (en) Multi-label learning method based on global and local label relation
CN107391565A (en) A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Yang et al. Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system
Sheng et al. Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos
Wang et al. RETRACTED ARTICLE: Human behaviour recognition and monitoring based on deep convolutional neural networks
CN106709277A (en) Text-mining-based vector generating method of G-protein coupled receptor drug target molecules
Chen et al. Multi-modal feature fusion based on variational autoencoder for visual question answering
Xu et al. MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation
Mi et al. Multiple Domain-Adversarial Ensemble Learning for Domain Generalization
Vecchi et al. Transferring multiple text styles using CycleGAN with supervised style latent space
CN116721176B (en) Text-to-face image generation method and device based on CLIP supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant