CN108765383B - Video description method based on deep migration learning - Google Patents

Video description method based on deep migration learning Download PDF

Info

Publication number
CN108765383B
CN108765383B CN201810465849.4A CN201810465849A CN108765383B CN 108765383 B CN108765383 B CN 108765383B CN 201810465849 A CN201810465849 A CN 201810465849A CN 108765383 B CN108765383 B CN 108765383B
Authority
CN
China
Prior art keywords
domain
video
semantic
image
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810465849.4A
Other languages
Chinese (zh)
Other versions
CN108765383A (en
Inventor
张丽红
曹刘彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Publication of CN108765383A publication Critical patent/CN108765383A/en
Application granted granted Critical
Publication of CN108765383B publication Critical patent/CN108765383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20164Salient point detection; Corner detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of video processing, and particularly relates to a video description method based on deep migration learning. The method comprises the following steps of 1) representing a video into a vector form through a convolutional neural network video representation model; 2) constructing an image semantic feature detection model by utilizing multi-instance learning so as to extract image domain semantic features; 3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract frame stream semantic features and realize deep fusion of the image domain and the frame stream domain semantic features; 4) and constructing a deep migration learning video description framework to generate a video natural language description. The invention carries out deep fusion on semantic features in different domains of the input end so as to improve the accuracy of video description generation.

Description

Video description method based on deep migration learning
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to a video description method based on deep migration learning.
Background
The video description is to describe the video by using natural language, is the key point and the difficulty in the fields of computer vision and natural language processing, and has wide application prospect in the field of artificial intelligence.
The video description is very different from the image description, and the video description is to understand not only the object in each frame but also the motion of the object between frames. The existing video description methods mainly have the following four types: 1) words detected in the visual content are assigned to each sentence fragment and a video description is then generated using a predefined language template. The method is very dependent on sentence templates, and the syntactic structure of the generated sentences is relatively fixed; 2) learning the probability distribution of a joint space formed by the visual content and the text sentences, wherein the generated sentences have more flexible syntactic structures; 3) training an attribute detector by utilizing multi-example learning, and then generating a video description through a maximum entropy language model based on the output of the attribute detector; 4) and (3) taking the convolutional neural network and the cyclic neural network as centers, and integrating semantic features mined from the image and the frame stream through a simple linear migration unit to generate the video description. The first two methods do not utilize semantic features in the video description process; although the latter two methods consider semantic features at the input end, they do not perform deep fusion on semantic features in different domains.
The conventional video description method is not accurate enough in description semantics, and in order to improve the description accuracy, a deep migration learning video description model is designed.
Disclosure of Invention
In order to solve the above problems, the present invention provides a video description method based on deep migration learning.
The invention adopts the following technical scheme: a video description method based on deep migration learning comprises the following steps,
1) representing the video into a vector form through a convolutional neural network video representation model;
2) constructing an image semantic feature detection model by utilizing multi-instance learning so as to extract image domain semantic features;
3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract frame stream semantic features and realize deep fusion of the image domain and the frame stream domain semantic features;
4) and constructing a deep migration learning video description framework to generate a video natural language description.
In the step 1), a convolutional neural network model is adopted to complete the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.
In the step 2), a multi-example learning is adopted on the image description standard database to construct an image semantic feature detection model.
The method comprises the following specific steps:
for a semantic feature waIf w isaIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, picture I will be treated as a negative packet. Firstly, inputting each packet into an image semantic feature detection model, dividing each packet into a plurality of regions by a full convolution neural network, and then calculating the semantic feature w according to the probability of all the regions (examples) in the packetaBag bIIs shown in formula (1):
Figure RE-GDA0001738132390000011
wherein the content of the first and second substances,
Figure RE-GDA0001738132390000012
is a characteristic waIs determined by the region riAnd performing prediction and calculating through a sigmoid layer which is positioned after the last convolution layer of the full convolution neural network. In addition, the dimension of the activation function of the last convolutional layer of the full convolutional neural network is x × x × h, and h represents the representation dimension of each region in a packet, so that an x × x dimensional feature map is obtained for each packet. The model is then optimized using a cross entropy loss layer. Finally, use in the figureTraining the image description data set to obtain an image semantic feature detection model, calculating probability distribution about all semantic features for each single sampling frame, and performing mean pooling on the feature distribution of all sampling frames to obtain a final representation of the semantic features learned from the image.
In step 3), the domain formed by the image samples is called a source domain, the domain formed by the frame stream samples is called a target domain, and the final target of the model is: for the distribution of the target domain, given the input x, the semantic feature y can be predicted.
The method comprises the following specific steps:
in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping GfMapping input x into a D-dimensional feature vector f ∈ RDThe mapped parameter vector is thetaf(ii) a Then, by mapping GyMapping the feature vector f into semantic feature y, and mapping the parameter vector thetay(ii) a Finally, by a mapping GdMapping the same feature vector f into a domain label d, wherein the mapped parameter vector is thetad
In the training stage, the frame stream semantic feature detection model satisfies the following three parameters: (1) finding a parameter θyMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the semantic feature detection model is not distorted in the source domain; (2) finding a feature mapping parameter θfSo as to pass the mapping G on the source domainfExtracted feature SfAnd the feature T extracted from the target domainfSimilarly, distribution SfAnd TfBy computing a domain classifier GdIs estimated. Domain invariant features are obtained such that the two feature distributions are as similar as possible, thereby maximizing the loss of the domain classifier. (3) Finding a parameter θ of a domain classifierdThe loss of domain classifiers is minimized. The idea of a competing network is utilized here. Three parameters that satisfy the requirement constitute one point (theta)fyd) Referred to as saddle points. The whole training process can be expressed as formula (2):
Figure RE-GDA0001738132390000021
wherein L isy(,) is the loss of semantic feature prediction; l isd(,) is a loss of domain classification,
Figure RE-GDA0001738132390000022
and
Figure RE-GDA0001738132390000023
representing the corresponding loss function evaluated on the ith training sample; the parameter λ is used to balance the feature vectors of the two domains formed during training; thus, the saddle point (θ)fyd) The solution can be carried out through the formula (2), and saddle points are searched by adopting the methods shown in the formulas (3), (4) and (5);
Figure RE-GDA0001738132390000024
Figure RE-GDA0001738132390000031
Figure RE-GDA0001738132390000032
where μ is the learning rate, during back propagation, the gradient is taken from the next layer in equation (3), this gradient is multiplied by- λ and passed to the previous layer, this part being the gradient inversion layer. The frame flow semantic feature detection model mainly comprises a feature extractor, a gradient reverse layer and a domain classifier. The feature extractor mainly extracts semantic features in the frame flow domain, and the domain classifier and the gradient reverse layer are combined to fuse the image domain and the frame flow domain semantic features. After training is completed, a semantic feature predictor is used for predicting the data fromSemantic features of the target domain and source domain samples. Due to SfAnd TfThe two domains are invariant feature vectors, so that the semantic features on the image domain and the frame stream domain mapped by the two domains also keep the domain invariant characteristic, namely, the semantic features extracted from the two domains realize depth fusion. Therefore, the semantic features obtained by the frame stream semantic feature detection model can be directly used as the input of the video description frame, and the semantic features are marked as Aiv
In the step 4), the workflow of the whole framework includes the following steps:
(1) obtaining a vector representation v of a given video by using a convolutional neural network video representation model, and inputting the vector representation v into a first layer of a recurrent neural network (LSTM) only at an initial moment;
(2) training an image semantic feature detection model on an image data set;
(3) splitting a given video frame into separate images, and sequentially inputting the images into a frame stream semantic feature detection model;
(4) taking a given video frame as a frame stream, and inputting the frame stream into a frame stream semantic feature detection model in parallel;
(5) obtaining fusion semantic feature A by using frame stream semantic feature detection modelivVector representation like "Man", "Person", etc., and AivInput to a second layer of the LSTM;
(6) inputting English description of a given video into the first layer of the LSTM word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps to train a video description framework.
The model structure represented by the whole framework is described by equations (6) and (7),
E(v,Aiv,S)=-logP(S|v,Aiv) (6)
Figure RE-GDA0001738132390000033
wherein v isInput video, AivTo fuse semantic features, S is a sentence description, E is an energy loss function, wtFor word representation, NsFor the number of words in a sentence, the ultimate goal is to minimize the energy loss function, preserving the contextual relationship between words in the sentence.
In the framework, video v is input into the first layer LSTM unit only at time t-1, and then a is inputivAs an additional input, the semantic information is enhanced by inputting the second layer of LSTM units in each iteration, as shown in equations (8), (9) and (10), t is from 0 to Ns-1 performing an iteration:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
wherein the content of the first and second substances,
Figure RE-GDA0001738132390000041
and
Figure RE-GDA0001738132390000042
respectively, the transformation matrix and w of the video vtOf the transformation matrix, DeIs the dimension of the LSTM input, DvIs the dimension of the video v, DwIs wtDimension of (2), xtAnd htInput and output of the second layer LSTM cells, respectively, f1And f2Are the mapping functions within the first and second layer LSTM units, respectively.
Compared with the prior art, the invention constructs a new video description model. The model carries out depth fusion on semantic features in different domains of the input end by utilizing a depth domain adaptation method in transfer learning so as to improve the accuracy of video description generation. Experiments are carried out on the MSVD video data set, the feasibility and the effectiveness of the method are verified, the depth domain adaptation method is shown to be capable of better achieving fusion of semantic features in different domains, the accuracy rate of video description is further improved, and the generalization capability of a network is improved.
Drawings
FIG. 1 is a convolutional neural network video representation model;
FIG. 2 is an image semantic feature detection model;
FIG. 3 is a frame stream semantic feature detection model of the present invention;
FIG. 4 is a video description framework;
FIG. 5 is a partial result of the present invention on a test data set.
Detailed Description
The following describes in detail embodiments of the present invention.
A video description method based on deep migration learning comprises the following steps,
1) representing the video into a vector form through a convolutional neural network video representation model; the specific model structure is shown in fig. 1.
In the step 1), a convolutional neural network model is adopted to finish the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.
2) And constructing an image semantic feature detection model by utilizing multi-example learning so as to extract image domain semantic features. The image semantic feature detection model is shown in fig. 2.
The method comprises the following specific steps:
for a semantic feature waIf w isaIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, the image I is treated as a negative packet, each packet is first input into the image semantic feature detection model (as shown in FIG. 2), and then the semantic feature w is calculated according to the probability of all the regions in the packetaBag bIIs shown in formula (1):
Figure RE-GDA0001738132390000051
wherein the content of the first and second substances,
Figure RE-GDA0001738132390000052
is a characteristic waIs determined by the region riPredicting, calculating through a sigmoid layer, wherein the sigmoid layer is positioned behind the last convolutional layer of the full convolutional neural network, in addition, the dimension of an activation function of the last convolutional layer of the full convolutional neural network is x h, h represents the representation dimension of each region in a packet, so that an x dimension feature map is obtained for each packet, then, optimizing the model by using a cross entropy loss layer, finally, obtaining an image semantic feature detection model by training on an image description data set, calculating probability distribution about all semantic features for each single sampling frame respectively, and performing mean pooling on the feature distribution of all sampling frames to obtain the final representation of the semantic features learned from the image.
3) Transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract the frame stream semantic features and realize the depth fusion of the image domain and the frame stream domain semantic features. The frame semantic feature detection model is shown in fig. 3.
The domain formed by the image samples is called the source domain, the domain formed by the frame stream samples is called the target domain, and the final target of the model is: for the distribution of the target domain, given an input x, the semantic features y can be predicted;
the method comprises the following specific steps:
in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping GfMapping input x into a D-dimensional feature vector f ∈ RDThe mapped parameter vector is thetaf(ii) a Then, by mapping GyMapping the feature vector f into semantic feature y, and mapping the parameter vector thetay(ii) a Finally, by a mapping GdWill be identicalThe eigenvector f is mapped into a domain label d, and the mapped parameter vector is thetad
In the training phase, the semantic feature detection model satisfies the following three parameters:
(1) finding a parameter θyMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the semantic feature detection model is not distorted in the source domain;
(2) finding a feature mapping parameter θfSo as to pass the mapping G on the source domainfExtracted feature SfAnd the feature T extracted from the target domainfSimilarly, distribution SfAnd TfBy computing a domain classifier GdObtaining domain invariant features to make the two feature distributions as similar as possible, thereby maximizing the loss of the domain classifier;
(3) finding a parameter θ of a domain classifierdMinimizing the loss of domain classifiers; three parameters that satisfy the requirement constitute one point (theta)fyd) Called saddle point, the whole training process can be expressed as equation (2):
Figure RE-GDA0001738132390000061
wherein L isy(,) is the loss of semantic feature prediction; l isd(,) is a loss of domain classification,
Figure RE-GDA0001738132390000062
and
Figure RE-GDA0001738132390000063
representing the corresponding loss function evaluated on the ith training sample; the parameter λ is used to balance the feature vectors of the two domains formed during training; thus, the saddle point (θ)fyd) The solution can be carried out through the formula (2), and saddle points are searched by adopting the methods shown in the formulas (3), (4) and (5);
Figure RE-GDA0001738132390000064
Figure RE-GDA0001738132390000065
Figure RE-GDA0001738132390000066
wherein mu is learning rate, during back propagation, gradient is obtained from the next layer in the formula (3), the gradient is multiplied by-lambda and is transmitted to the previous layer, the gradient is a reverse layer, the semantic feature detection model mainly comprises a feature extractor, a reverse layer and a domain classifier, after training is completed, a semantic feature predictor is used for predicting semantic features from a target domain and a source domain sample, the semantic features obtained by the improved semantic feature detection model can be directly used as input of a video description frame, and the semantic features are recorded as Aiv
4) And constructing a deep migration learning video description framework to generate a video natural language description.
The method comprises the following steps:
(1) obtaining a vector representation v of a given video by using a convolutional neural network video representation model, and inputting the vector representation v into a first layer of a recurrent neural network only at an initial moment;
(2) training an image semantic feature detection model on an image data set;
(3) splitting a given video frame into separate images, and sequentially inputting the images into a frame stream semantic feature detection model;
(4) taking a given video frame as a frame stream, and inputting the frame stream into a frame stream semantic feature detection model in parallel;
(5) obtaining fusion semantic feature A by using frame stream semantic feature detection modelivAnd A isivInput to a second layer of the recurrent neural network;
(6) and inputting English description of a given video into a first layer of a recurrent neural network word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps so as to train a video description framework. The video description framework structure is shown in fig. 4.
The model structure represented by the whole framework is described by equations (6) and (7),
E(v,Aiv,S)=-logP(S|v,Aiv) (6)
Figure RE-GDA0001738132390000071
where v is the input video, AivTo fuse semantic features, S is a sentence description, E is an energy loss function, wtFor word representation, NsFor the number of words in the sentence, the final goal is to minimize the energy loss function and preserve the context between the words in the sentence;
in the framework, the video v is input into the first layer recurrent neural network unit only at the time t-1, and then A is inputivAs an additional input, the semantic information is enhanced by inputting the semantic information into the second layer recurrent neural network unit in each iteration, as shown in formulas (8), (9) and (10), t is from 0 to Ns-1 performing an iteration:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
wherein the content of the first and second substances,
Figure RE-GDA0001738132390000072
and
Figure RE-GDA0001738132390000073
respectively, the transformation matrix and w of the video vtOf the transformation matrix, DeIs the dimension of the recurrent neural network input, DvIs the dimension of the video v, DwIs wtDimension of (2), xtAnd htInput and output of the second layer recurrent neural network unit, respectively, f1And f2Are the mapping functions within the first and second layers of recurrent neural network elements, respectively.
Experiment and analysis of results
Data set:
in order to evaluate the video description model of the present invention, the most popular video description data set MSVD on YouTube was selected. MSVD contains 1970 video clips collected from YouTube. There are about 40 english descriptions available per video. In the experiment, 1200 videos were used for training, 100 videos were used for validation, and 670 videos were used for testing. Furthermore, an image data set COCO is used.
Evaluation indexes are as follows:
in order to quantitatively evaluate the proposed video description framework, three indexes commonly used in the video description task are adopted in the text: BLEU @ N (BiLingulal Evaluation understudy), METEOR, and CIDER-D (Consensus-based Image Description Evaluation). For the BLEU @ N index, N is 3 and 4. All metrics were calculated using the code published by the microsoft Coco rating service. The calculation results of the three indexes are all percentages, and the higher the score is, the closer the generated video description is to the reference description is.
Experimental setup:
the present invention uniformly samples 25 frames per video and represents each word in the sentence as a "one-hot" vector; for video representation, VGG19 was pre-trained on the Imagenet ILSVRC12 dataset, and then the model shown in fig. 1 was fine-tuned on MSVD; in order to represent the fused semantic features extracted from the two domains, 1000 most commonly used words are selected on a COCO image data set and an MSVD video data set respectively as the labeled semantic features of the two domains[4]I.e. training data sets for both models of fig. 2 and 3. Firstly, training the model of FIG. 2 on a COCO training set, and then training the model of FIG. 3 on two training sets of COCO and MSVD to generate a final 1000-dimensional probability vector; in LSTM, the dimensions of the input and hidden layers are set to 1024. In the testing phase, the Beam Search strategy is adopted, and the training in FIG. 4 is utilizedThe refined model generates a new video sentence description and sets the beam size to 4.
Quantitative analysis:
table 1 shows the comparison of the scores of the video description model proposed herein and the existing seven models on each evaluation index on the MSVD test data set. The simulation results obtained by machines with different configurations are different, and the data listed in the table are all referred to the same machine.
TABLE 1 score comparison of the respective models
Tab.1 Score comparison of each model
Figure RE-GDA0001738132390000081
The models 1-4 in the table utilize an attention-based method, and semantic features are not introduced; models 5, 6 only make use of semantic features of a single domain; the model 7 takes advantage of the semantic features of the two domains and performs a simple linear fusion of them. Comparing the data in the analysis table, it can be seen that: on four evaluation indexes, the video description model provided by the method obtains higher scores. It follows from this that: 1) in a video description frame, by utilizing high-level semantic features, visual representation can be enhanced, and model learning video description is facilitated; 2) the semantic features of a single domain (an image domain or a frame stream domain) are only utilized, so that the video description performance is not obviously improved; 3) only simple linear fusion is carried out on semantic features in two domains, and although various indexes of video description are improved, the method still has the defects and needs to be improved; 4) the fusion semantic features obtained by the depth domain adaptation method in the transfer learning obviously improve the video description performance, namely the method has better effect in the aspect of semantic feature fusion.
And (3) qualitative analysis:
fig. 5 shows partial results of the video description model proposed herein on a test data set.
The sample frames illustrated in the figure are partial frames of each test video, and it can be seen from these examples that the video description framework proposed herein can generate english description of the test video more accurately compared with the LSTM-TSAIV model with better performance.

Claims (6)

1. A video description method based on deep transfer learning is characterized in that: comprises the following steps of (a) carrying out,
1) representing the video into a vector form through a convolutional neural network video representation model;
2) constructing an image semantic feature detection model by utilizing multi-instance learning so as to extract image domain semantic features;
3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a new semantic feature detection model so as to extract the frame stream semantic features and realize the depth fusion of the image domain and the frame stream domain semantic features;
4) constructing a deep migration learning video description framework and generating a video natural language description, and specifically comprising the following steps:
(1) obtaining a vector representation v of a given video by using a convolutional neural network video representation model, and inputting the vector representation v into a first layer of a recurrent neural network only at an initial moment;
(2) training an image semantic feature detection model on an image data set;
(3) splitting a given video frame into separate images, and sequentially inputting the images into a frame stream semantic feature detection model;
(4) taking a given video frame as a frame stream, and inputting the frame stream into a frame stream semantic feature detection model in parallel;
(5) obtaining fusion semantic feature A by using frame stream semantic feature detection modelivAnd A isivInput to a second layer of the recurrent neural network;
(6) and inputting English description of a given video into a first layer of a recurrent neural network word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps so as to train a video description framework.
2. The video description method based on deep migration learning of claim 1, characterized in that: in the step 1), a convolutional neural network model is adopted to complete the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.
3. The video description method based on deep migration learning of claim 2, characterized in that: in the step 2):
the method comprises the following specific steps:
for a semantic feature waIf w isaIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, the image I is regarded as a negative packet, each packet is firstly input into the image semantic feature detection model, and then the semantic feature w is calculated according to the probability of all the regions in the packetaBag bIIs shown in formula (1):
Figure FDA0003361771010000011
wherein the content of the first and second substances,
Figure FDA0003361771010000012
is a characteristic waIs determined by the region riPredicting, calculating through a sigmoid layer, the sigmoid layer is positioned behind the last convolution layer of the full convolution neural network, in addition, the dimension of the activation function of the last convolution layer of the full convolution neural network is x h, h represents the representation dimension of each area in the packet, therefore, for each packet, an x dimension feature map is obtained, then, optimizing the model by using a cross entropy loss layer, finally, obtaining an image semantic feature detection model by training on an image description data set, calculating probability distribution about all semantic features for each single sampling frame respectively, and performing mean pooling on the feature distribution of all sampling frames,a final representation of the semantic features learned from the image is obtained.
4. The video description method based on deep migration learning of claim 3, characterized in that: step 3) described above, the domain formed by the image samples is called the source domain, the domain formed by the frame stream samples is called the target domain, and the final target of the model is: for the distribution of the target domain, given an input x, the semantic features y can be predicted;
the method comprises the following specific steps:
in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the frame stream semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping GfMapping input x into a D-dimensional feature vector f ∈ RDThe mapped parameter vector is thetaf(ii) a Then, by mapping GyMapping the feature vector f into semantic feature y, and mapping the parameter vector thetay(ii) a Finally, by a mapping GdMapping the same feature vector f into a domain label d, wherein the mapped parameter vector is thetad
5. The video description method based on deep migration learning of claim 4, wherein:
in the training process, the frame flow semantic feature detection model meets the following three parameters:
(1) finding a parameter θyMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the frame stream semantic feature detection model is not distorted in the source domain;
(2) finding a feature mapping parameter θfSo as to pass the mapping G on the source domainfExtracted feature SfAnd the feature T extracted from the target domainfSimilarly, distribution SfAnd TfBy computing a domain classifier GdObtaining domain invariant features to make the two feature distributions as similar as possible, thereby maximizing the loss of the domain classifier;
(3) finding a parameter θ of a domain classifierdMinimizing the loss of domain classifiers; three parameters that satisfy the requirement constitute one point (theta)fyd) Called saddle point, the whole training process can be expressed as equation (2):
Figure FDA0003361771010000021
wherein L isy(,) is the loss of semantic feature prediction; l isd(,) is a loss of domain classification,
Figure FDA0003361771010000023
and
Figure FDA0003361771010000024
representing the corresponding loss function evaluated on the ith training sample; the parameter λ is used to balance the feature vectors of the two domains formed during training; thus, the saddle point (θ)fyd) The solution can be carried out through the formula (2), and saddle points are searched by adopting the methods shown in the formulas (3), (4) and (5);
Figure FDA0003361771010000022
Figure FDA0003361771010000031
Figure FDA0003361771010000032
where μ is the learning rate, during back propagation, in equation (3) a gradient is taken from the next layer, this gradient is multiplied by- λ and passed to the previous layer, this part is the gradient-reversed layer, the semantic feature detection model includes a feature extractor, a gradient inverseA layer-oriented and domain classifier, a feature extractor for extracting semantic features in the frame stream domain, a domain classifier and a gradient reverse layer are combined to fuse the semantic features of the image domain and the frame stream domain, a frame stream semantic feature predictor is used for predicting the semantic features from a target domain and a source domain sample after training is finished, the semantic features obtained by a frame stream semantic feature detection model can be directly used as the input of a video description frame, and the semantic features are marked as Aiv
6. The video description method based on deep migration learning of claim 5, wherein: in the step 4, the step of processing the image,
the model structure represented by the whole framework is described by equations (6) and (7),
E(v,Aiv,S)=-logP(S|v,Aiv) (6)
Figure FDA0003361771010000033
where v is the input video, AivTo fuse semantic features, S is a sentence description, E is an energy loss function, wtFor word representation, NsFor the number of words in the sentence, the final goal is to minimize the energy loss function and preserve the context between the words in the sentence;
in the framework, the video v is input into the first layer recurrent neural network unit only at the time t-1, and then A is inputivAs an additional input, the semantic information is enhanced by inputting the semantic information into the second layer recurrent neural network unit in each iteration, as shown in formulas (8), (9) and (10), t is from 0 to Ns-1 performing an iteration:
x-1=f1(Tvv)+Aiv (8)
xt=f1(Tswt)+Aiv (9)
ht=f2(xt) (10)
wherein the content of the first and second substances,
Figure FDA0003361771010000034
and
Figure FDA0003361771010000035
respectively, the transformation matrix and w of the video vtOf the transformation matrix, DeIs the dimension of the recurrent neural network input, DvIs the dimension of the video v, DwIs wtDimension of (2), xtAnd htInput and output of the second layer recurrent neural network unit, respectively, f1And f2Are the mapping functions within the first and second layers of recurrent neural network elements, respectively.
CN201810465849.4A 2018-03-22 2018-05-15 Video description method based on deep migration learning Active CN108765383B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810250752 2018-03-22
CN2018102507521 2018-03-22

Publications (2)

Publication Number Publication Date
CN108765383A CN108765383A (en) 2018-11-06
CN108765383B true CN108765383B (en) 2022-03-18

Family

ID=64008024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465849.4A Active CN108765383B (en) 2018-03-22 2018-05-15 Video description method based on deep migration learning

Country Status (1)

Country Link
CN (1) CN108765383B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435453B (en) * 2019-01-14 2022-07-22 中国科学技术大学 Fine-grained image zero sample identification method
CN111464881B (en) * 2019-01-18 2021-08-13 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN109919114A (en) * 2019-03-14 2019-06-21 浙江大学 One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
CN110084296B (en) * 2019-04-22 2023-07-21 中山大学 Graph representation learning framework based on specific semantics and multi-label classification method thereof
CN110166850B (en) * 2019-05-30 2020-11-06 上海交通大学 Method and system for predicting panoramic video watching position by multiple CNN networks
CN110363164A (en) * 2019-07-18 2019-10-22 南京工业大学 A kind of unified approach based on LSTM time consistency video analysis
CN112446239A (en) * 2019-08-29 2021-03-05 株式会社理光 Neural network training and target detection method, device and storage medium
CN110909736A (en) * 2019-11-12 2020-03-24 北京工业大学 Image description method based on long-short term memory model and target detection algorithm
CN111988673B (en) * 2020-07-31 2023-05-23 清华大学 Method and related equipment for generating video description sentences
CN113177478B (en) * 2021-04-29 2022-08-05 西华大学 Short video semantic annotation method based on transfer learning
CN115098727A (en) * 2022-06-16 2022-09-23 电子科技大学 Video description generation method based on visual common sense knowledge representation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915400A (en) * 2015-05-29 2015-09-16 山西大学 Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT)
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105976401A (en) * 2016-05-20 2016-09-28 河北工业职业技术学院 Target tracking method and system based on partitioned multi-example learning algorithm
CN106202256A (en) * 2016-06-29 2016-12-07 西安电子科技大学 Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244924B2 (en) * 2012-04-23 2016-01-26 Sri International Classification, search, and retrieval of complex video events

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915400A (en) * 2015-05-29 2015-09-16 山西大学 Fuzzy correlation synchronized image retrieval method based on color histogram and non-subsampled contourlet transform (NSCT)
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105976401A (en) * 2016-05-20 2016-09-28 河北工业职业技术学院 Target tracking method and system based on partitioned multi-example learning algorithm
CN106202256A (en) * 2016-06-29 2016-12-07 西安电子科技大学 Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Image Captioning with Semantic Attention;Q You等;《2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20161212;4651-4659 *
Multi-lingual author identification and linguistic feature extraction — A machine learning approach;Hassan Alam等;《2013 IEEE International Conference on Technologies for Homeland Security (HST)》;20140106;386-389 *
Unsupervised Domain Adaptation by Backpropagation;Ganin Y等;《ICML"15: Proceedings of the 32nd International Conference on International Conference on Machine Learning》;20150227;第50卷(第6期);正文第2-4页 *
图像语义检索和分类技术研究;易文晟;《中国博士学位论文全文数据库 (信息科技辑)》;20070215(第2期);正文第76-77页、5.4.1节 *
基于多核属性学习的视频多概念检测研究;惠开发等;《软件导刊》;20170630;第16卷(第6期);1-5 *
统计机器翻译中大规模特征的深度融合;刘宇鹏等;《浙江大学学报》;20170131;第51卷(第1期);1-11 *

Also Published As

Publication number Publication date
CN108765383A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108765383B (en) Video description method based on deep migration learning
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
Yang et al. Learning transferred weights from co-occurrence data for heterogeneous transfer learning
Perez-Martin et al. Improving video captioning with temporal composition of a visual-syntactic embedding
Wu et al. GINet: Graph interaction network for scene parsing
Geng et al. Gated path selection network for semantic segmentation
Zhu et al. Not all features matter: Enhancing few-shot clip with adaptive prior refinement
CN111046275A (en) User label determining method and device based on artificial intelligence and storage medium
CN111985538A (en) Small sample picture classification model and method based on semantic auxiliary attention mechanism
CN112364791B (en) Pedestrian re-identification method and system based on generation of confrontation network
Zhang et al. Quantifying the knowledge in a DNN to explain knowledge distillation for classification
Shen et al. Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description.
CN109947923A (en) A kind of elementary mathematics topic type extraction method and system based on term vector
Jandial et al. Trace: Transform aggregate and compose visiolinguistic representations for image search with text feedback
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
Lai et al. Improving graph-based sentence ordering with iteratively predicted pairwise orderings
Pande et al. Development and deployment of a generative model-based framework for text to photorealistic image generation
CN110795410A (en) Multi-field text classification method
Zhou et al. Multi-objective evolutionary generative adversarial network compression for image translation
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Xu et al. Cross-domain few-shot classification via inter-source stylization
CN116452688A (en) Image description generation method based on common attention mechanism
Sheng et al. Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos
CN113591731A (en) Knowledge distillation-based weak surveillance video time sequence behavior positioning method
Jiang et al. Learning from noisy labels with noise modeling network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant