CN108765383B

CN108765383B - Video description method based on deep migration learning

Info

Publication number: CN108765383B
Application number: CN201810465849.4A
Authority: CN
Inventors: 张丽红; 曹刘彬
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2018-03-22
Filing date: 2018-05-15
Publication date: 2022-03-18
Anticipated expiration: 2038-05-15
Also published as: CN108765383A

Abstract

The invention belongs to the technical field of video processing, and particularly relates to a video description method based on deep migration learning. The method comprises the following steps of 1) representing a video into a vector form through a convolutional neural network video representation model; 2) constructing an image semantic feature detection model by utilizing multi-instance learning so as to extract image domain semantic features; 3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract frame stream semantic features and realize deep fusion of the image domain and the frame stream domain semantic features; 4) and constructing a deep migration learning video description framework to generate a video natural language description. The invention carries out deep fusion on semantic features in different domains of the input end so as to improve the accuracy of video description generation.

Description

Video description method based on deep migration learning

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video description method based on deep migration learning.

Background

The video description is to describe the video by using natural language, is the key point and the difficulty in the fields of computer vision and natural language processing, and has wide application prospect in the field of artificial intelligence.

The video description is very different from the image description, and the video description is to understand not only the object in each frame but also the motion of the object between frames. The existing video description methods mainly have the following four types: 1) words detected in the visual content are assigned to each sentence fragment and a video description is then generated using a predefined language template. The method is very dependent on sentence templates, and the syntactic structure of the generated sentences is relatively fixed; 2) learning the probability distribution of a joint space formed by the visual content and the text sentences, wherein the generated sentences have more flexible syntactic structures; 3) training an attribute detector by utilizing multi-example learning, and then generating a video description through a maximum entropy language model based on the output of the attribute detector; 4) and (3) taking the convolutional neural network and the cyclic neural network as centers, and integrating semantic features mined from the image and the frame stream through a simple linear migration unit to generate the video description. The first two methods do not utilize semantic features in the video description process; although the latter two methods consider semantic features at the input end, they do not perform deep fusion on semantic features in different domains.

The conventional video description method is not accurate enough in description semantics, and in order to improve the description accuracy, a deep migration learning video description model is designed.

Disclosure of Invention

In order to solve the above problems, the present invention provides a video description method based on deep migration learning.

The invention adopts the following technical scheme: a video description method based on deep migration learning comprises the following steps,

1) representing the video into a vector form through a convolutional neural network video representation model;

2) constructing an image semantic feature detection model by utilizing multi-instance learning so as to extract image domain semantic features;

3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract frame stream semantic features and realize deep fusion of the image domain and the frame stream domain semantic features;

4) and constructing a deep migration learning video description framework to generate a video natural language description.

In the step 1), a convolutional neural network model is adopted to complete the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.

In the step 2), a multi-example learning is adopted on the image description standard database to construct an image semantic feature detection model.

The method comprises the following specific steps:

for a semantic feature w_aIf w is_aIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, picture I will be treated as a negative packet. Firstly, inputting each packet into an image semantic feature detection model, dividing each packet into a plurality of regions by a full convolution neural network, and then calculating the semantic feature w according to the probability of all the regions (examples) in the packet_aBag b_IIs shown in formula (1):

wherein the content of the first and second substances,

is a characteristic w_aIs determined by the region r_iAnd performing prediction and calculating through a sigmoid layer which is positioned after the last convolution layer of the full convolution neural network. In addition, the dimension of the activation function of the last convolutional layer of the full convolutional neural network is x × x × h, and h represents the representation dimension of each region in a packet, so that an x × x dimensional feature map is obtained for each packet. The model is then optimized using a cross entropy loss layer. Finally, use in the figureTraining the image description data set to obtain an image semantic feature detection model, calculating probability distribution about all semantic features for each single sampling frame, and performing mean pooling on the feature distribution of all sampling frames to obtain a final representation of the semantic features learned from the image.

In step 3), the domain formed by the image samples is called a source domain, the domain formed by the frame stream samples is called a target domain, and the final target of the model is: for the distribution of the target domain, given the input x, the semantic feature y can be predicted.

The method comprises the following specific steps:

in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping G_fMapping input x into a D-dimensional feature vector f ∈ R^DThe mapped parameter vector is theta_f(ii) a Then, by mapping G_yMapping the feature vector f into semantic feature y, and mapping the parameter vector theta_y(ii) a Finally, by a mapping G_dMapping the same feature vector f into a domain label d, wherein the mapped parameter vector is theta_d。

In the training stage, the frame stream semantic feature detection model satisfies the following three parameters: (1) finding a parameter θ_yMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the semantic feature detection model is not distorted in the source domain; (2) finding a feature mapping parameter θ_fSo as to pass the mapping G on the source domain_fExtracted feature S_fAnd the feature T extracted from the target domain_fSimilarly, distribution S_fAnd T_fBy computing a domain classifier G_dIs estimated. Domain invariant features are obtained such that the two feature distributions are as similar as possible, thereby maximizing the loss of the domain classifier. (3) Finding a parameter θ of a domain classifier_dThe loss of domain classifiers is minimized. The idea of a competing network is utilized here. Three parameters that satisfy the requirement constitute one point (theta)_f,θ_y,θ_d) Referred to as saddle points. The whole training process can be expressed as formula (2):

wherein L is_y(,) is the loss of semantic feature prediction; l is_d(,) is a loss of domain classification,

and

representing the corresponding loss function evaluated on the ith training sample; the parameter λ is used to balance the feature vectors of the two domains formed during training; thus, the saddle point (θ)_f,θ_y,θ_d) The solution can be carried out through the formula (2), and saddle points are searched by adopting the methods shown in the formulas (3), (4) and (5);

where μ is the learning rate, during back propagation, the gradient is taken from the next layer in equation (3), this gradient is multiplied by- λ and passed to the previous layer, this part being the gradient inversion layer. The frame flow semantic feature detection model mainly comprises a feature extractor, a gradient reverse layer and a domain classifier. The feature extractor mainly extracts semantic features in the frame flow domain, and the domain classifier and the gradient reverse layer are combined to fuse the image domain and the frame flow domain semantic features. After training is completed, a semantic feature predictor is used for predicting the data fromSemantic features of the target domain and source domain samples. Due to S_fAnd T_fThe two domains are invariant feature vectors, so that the semantic features on the image domain and the frame stream domain mapped by the two domains also keep the domain invariant characteristic, namely, the semantic features extracted from the two domains realize depth fusion. Therefore, the semantic features obtained by the frame stream semantic feature detection model can be directly used as the input of the video description frame, and the semantic features are marked as A_iv。

In the step 4), the workflow of the whole framework includes the following steps:

(1) obtaining a vector representation v of a given video by using a convolutional neural network video representation model, and inputting the vector representation v into a first layer of a recurrent neural network (LSTM) only at an initial moment;

(2) training an image semantic feature detection model on an image data set;

(3) splitting a given video frame into separate images, and sequentially inputting the images into a frame stream semantic feature detection model;

(4) taking a given video frame as a frame stream, and inputting the frame stream into a frame stream semantic feature detection model in parallel;

(5) obtaining fusion semantic feature A by using frame stream semantic feature detection model_ivVector representation like "Man", "Person", etc., and A_ivInput to a second layer of the LSTM;

(6) inputting English description of a given video into the first layer of the LSTM word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps to train a video description framework.

The model structure represented by the whole framework is described by equations (6) and (7),

E(v,A_iv,S)＝-logP(S|v,A_iv) (6)

wherein v isInput video, A_ivTo fuse semantic features, S is a sentence description, E is an energy loss function, w_tFor word representation, N_sFor the number of words in a sentence, the ultimate goal is to minimize the energy loss function, preserving the contextual relationship between words in the sentence.

In the framework, video v is input into the first layer LSTM unit only at time t-1, and then a is input_ivAs an additional input, the semantic information is enhanced by inputting the second layer of LSTM units in each iteration, as shown in equations (8), (9) and (10), t is from 0 to N_s-1 performing an iteration:

x_-1＝f₁(T_vv)+A_iv (8)

x_t＝f₁(T_sw_t)+A_iv (9)

h_t＝f₂(x_t) (10)

wherein the content of the first and second substances,

and

respectively, the transformation matrix and w of the video v_tOf the transformation matrix, D_eIs the dimension of the LSTM input, D_vIs the dimension of the video v, D_wIs w_tDimension of (2), x_tAnd h_tInput and output of the second layer LSTM cells, respectively, f₁And f₂Are the mapping functions within the first and second layer LSTM units, respectively.

Compared with the prior art, the invention constructs a new video description model. The model carries out depth fusion on semantic features in different domains of the input end by utilizing a depth domain adaptation method in transfer learning so as to improve the accuracy of video description generation. Experiments are carried out on the MSVD video data set, the feasibility and the effectiveness of the method are verified, the depth domain adaptation method is shown to be capable of better achieving fusion of semantic features in different domains, the accuracy rate of video description is further improved, and the generalization capability of a network is improved.

Drawings

FIG. 1 is a convolutional neural network video representation model;

FIG. 2 is an image semantic feature detection model;

FIG. 3 is a frame stream semantic feature detection model of the present invention;

FIG. 4 is a video description framework;

FIG. 5 is a partial result of the present invention on a test data set.

Detailed Description

The following describes in detail embodiments of the present invention.

A video description method based on deep migration learning comprises the following steps,

1) representing the video into a vector form through a convolutional neural network video representation model; the specific model structure is shown in fig. 1.

In the step 1), a convolutional neural network model is adopted to finish the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.

2) And constructing an image semantic feature detection model by utilizing multi-example learning so as to extract image domain semantic features. The image semantic feature detection model is shown in fig. 2.

The method comprises the following specific steps:

for a semantic feature w_aIf w is_aIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, the image I is treated as a negative packet, each packet is first input into the image semantic feature detection model (as shown in FIG. 2), and then the semantic feature w is calculated according to the probability of all the regions in the packet_aBag b_IIs shown in formula (1):

wherein the content of the first and second substances,

is a characteristic w_aIs determined by the region r_iPredicting, calculating through a sigmoid layer, wherein the sigmoid layer is positioned behind the last convolutional layer of the full convolutional neural network, in addition, the dimension of an activation function of the last convolutional layer of the full convolutional neural network is x h, h represents the representation dimension of each region in a packet, so that an x dimension feature map is obtained for each packet, then, optimizing the model by using a cross entropy loss layer, finally, obtaining an image semantic feature detection model by training on an image description data set, calculating probability distribution about all semantic features for each single sampling frame respectively, and performing mean pooling on the feature distribution of all sampling frames to obtain the final representation of the semantic features learned from the image.

3) Transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a frame stream semantic feature detection model so as to extract the frame stream semantic features and realize the depth fusion of the image domain and the frame stream domain semantic features. The frame semantic feature detection model is shown in fig. 3.

The domain formed by the image samples is called the source domain, the domain formed by the frame stream samples is called the target domain, and the final target of the model is: for the distribution of the target domain, given an input x, the semantic features y can be predicted;

the method comprises the following specific steps:

in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping G_fMapping input x into a D-dimensional feature vector f ∈ R^DThe mapped parameter vector is theta_f(ii) a Then, by mapping G_yMapping the feature vector f into semantic feature y, and mapping the parameter vector theta_y(ii) a Finally, by a mapping G_dWill be identicalThe eigenvector f is mapped into a domain label d, and the mapped parameter vector is theta_d。

In the training phase, the semantic feature detection model satisfies the following three parameters:

(1) finding a parameter θ_yMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the semantic feature detection model is not distorted in the source domain;

(2) finding a feature mapping parameter θ_fSo as to pass the mapping G on the source domain_fExtracted feature S_fAnd the feature T extracted from the target domain_fSimilarly, distribution S_fAnd T_fBy computing a domain classifier G_dObtaining domain invariant features to make the two feature distributions as similar as possible, thereby maximizing the loss of the domain classifier;

(3) finding a parameter θ of a domain classifier_dMinimizing the loss of domain classifiers; three parameters that satisfy the requirement constitute one point (theta)_f,θ_y,θ_d) Called saddle point, the whole training process can be expressed as equation (2):

and

wherein mu is learning rate, during back propagation, gradient is obtained from the next layer in the formula (3), the gradient is multiplied by-lambda and is transmitted to the previous layer, the gradient is a reverse layer, the semantic feature detection model mainly comprises a feature extractor, a reverse layer and a domain classifier, after training is completed, a semantic feature predictor is used for predicting semantic features from a target domain and a source domain sample, the semantic features obtained by the improved semantic feature detection model can be directly used as input of a video description frame, and the semantic features are recorded as A_iv。

The method comprises the following steps:

(1) obtaining a vector representation v of a given video by using a convolutional neural network video representation model, and inputting the vector representation v into a first layer of a recurrent neural network only at an initial moment;

(2) training an image semantic feature detection model on an image data set;

(5) obtaining fusion semantic feature A by using frame stream semantic feature detection model_ivAnd A is_ivInput to a second layer of the recurrent neural network;

(6) and inputting English description of a given video into a first layer of a recurrent neural network word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps so as to train a video description framework. The video description framework structure is shown in fig. 4.

E(v,A_iv,S)＝-logP(S|v,A_iv) (6)

where v is the input video, A_ivTo fuse semantic features, S is a sentence description, E is an energy loss function, w_tFor word representation, N_sFor the number of words in the sentence, the final goal is to minimize the energy loss function and preserve the context between the words in the sentence;

in the framework, the video v is input into the first layer recurrent neural network unit only at the time t-1, and then A is input_ivAs an additional input, the semantic information is enhanced by inputting the semantic information into the second layer recurrent neural network unit in each iteration, as shown in formulas (8), (9) and (10), t is from 0 to N_s-1 performing an iteration:

x_-1＝f₁(T_vv)+A_iv (8)

x_t＝f₁(T_sw_t)+A_iv (9)

h_t＝f₂(x_t) (10)

wherein the content of the first and second substances,

and

respectively, the transformation matrix and w of the video v_tOf the transformation matrix, D_eIs the dimension of the recurrent neural network input, D_vIs the dimension of the video v, D_wIs w_tDimension of (2), x_tAnd h_tInput and output of the second layer recurrent neural network unit, respectively, f₁And f₂Are the mapping functions within the first and second layers of recurrent neural network elements, respectively.

Experiment and analysis of results

Data set:

in order to evaluate the video description model of the present invention, the most popular video description data set MSVD on YouTube was selected. MSVD contains 1970 video clips collected from YouTube. There are about 40 english descriptions available per video. In the experiment, 1200 videos were used for training, 100 videos were used for validation, and 670 videos were used for testing. Furthermore, an image data set COCO is used.

Evaluation indexes are as follows:

in order to quantitatively evaluate the proposed video description framework, three indexes commonly used in the video description task are adopted in the text: BLEU @ N (BiLingulal Evaluation understudy), METEOR, and CIDER-D (Consensus-based Image Description Evaluation). For the BLEU @ N index, N is 3 and 4. All metrics were calculated using the code published by the microsoft Coco rating service. The calculation results of the three indexes are all percentages, and the higher the score is, the closer the generated video description is to the reference description is.

Experimental setup:

the present invention uniformly samples 25 frames per video and represents each word in the sentence as a "one-hot" vector; for video representation, VGG19 was pre-trained on the Imagenet ILSVRC12 dataset, and then the model shown in fig. 1 was fine-tuned on MSVD; in order to represent the fused semantic features extracted from the two domains, 1000 most commonly used words are selected on a COCO image data set and an MSVD video data set respectively as the labeled semantic features of the two domains^[4]I.e. training data sets for both models of fig. 2 and 3. Firstly, training the model of FIG. 2 on a COCO training set, and then training the model of FIG. 3 on two training sets of COCO and MSVD to generate a final 1000-dimensional probability vector; in LSTM, the dimensions of the input and hidden layers are set to 1024. In the testing phase, the Beam Search strategy is adopted, and the training in FIG. 4 is utilizedThe refined model generates a new video sentence description and sets the beam size to 4.

Quantitative analysis:

table 1 shows the comparison of the scores of the video description model proposed herein and the existing seven models on each evaluation index on the MSVD test data set. The simulation results obtained by machines with different configurations are different, and the data listed in the table are all referred to the same machine.

TABLE 1 score comparison of the respective models

Tab.1 Score comparison of each model

The models 1-4 in the table utilize an attention-based method, and semantic features are not introduced; models 5, 6 only make use of semantic features of a single domain; the model 7 takes advantage of the semantic features of the two domains and performs a simple linear fusion of them. Comparing the data in the analysis table, it can be seen that: on four evaluation indexes, the video description model provided by the method obtains higher scores. It follows from this that: 1) in a video description frame, by utilizing high-level semantic features, visual representation can be enhanced, and model learning video description is facilitated; 2) the semantic features of a single domain (an image domain or a frame stream domain) are only utilized, so that the video description performance is not obviously improved; 3) only simple linear fusion is carried out on semantic features in two domains, and although various indexes of video description are improved, the method still has the defects and needs to be improved; 4) the fusion semantic features obtained by the depth domain adaptation method in the transfer learning obviously improve the video description performance, namely the method has better effect in the aspect of semantic feature fusion.

And (3) qualitative analysis:

fig. 5 shows partial results of the video description model proposed herein on a test data set.

The sample frames illustrated in the figure are partial frames of each test video, and it can be seen from these examples that the video description framework proposed herein can generate english description of the test video more accurately compared with the LSTM-TSAIV model with better performance.

Claims

1. A video description method based on deep transfer learning is characterized in that: comprises the following steps of (a) carrying out,

3) transferring the image semantic feature detection model in the step 2) into a frame stream domain to obtain a new semantic feature detection model so as to extract the frame stream semantic features and realize the depth fusion of the image domain and the frame stream domain semantic features;

4) constructing a deep migration learning video description framework and generating a video natural language description, and specifically comprising the following steps:

(2) training an image semantic feature detection model on an image data set;

(6) and inputting English description of a given video into a first layer of a recurrent neural network word by word, and predicting an output word at the next moment by using input words at the current moment and previous moments in combination with the input in the four steps so as to train a video description framework.

2. The video description method based on deep migration learning of claim 1, characterized in that: in the step 1), a convolutional neural network model is adopted to complete the task of video representation, for a group of sampling frames in the video, each frame is input into the convolutional neural network model, the output of a second full-connection layer is extracted, then mean pooling is performed on all the sampling frames, and a section of the video is represented as an n-dimensional vector.

3. The video description method based on deep migration learning of claim 2, characterized in that: in the step 2):

the method comprises the following specific steps:

for a semantic feature w_aIf w is_aIf the image I exists in the annotation text description of the image I, the image I is regarded as a positive packet; otherwise, the image I is regarded as a negative packet, each packet is firstly input into the image semantic feature detection model, and then the semantic feature w is calculated according to the probability of all the regions in the packet_aBag b_IIs shown in formula (1):

wherein the content of the first and second substances,

is a characteristic w_aIs determined by the region r_iPredicting, calculating through a sigmoid layer, the sigmoid layer is positioned behind the last convolution layer of the full convolution neural network, in addition, the dimension of the activation function of the last convolution layer of the full convolution neural network is x h, h represents the representation dimension of each area in the packet, therefore, for each packet, an x dimension feature map is obtained, then, optimizing the model by using a cross entropy loss layer, finally, obtaining an image semantic feature detection model by training on an image description data set, calculating probability distribution about all semantic features for each single sampling frame respectively, and performing mean pooling on the feature distribution of all sampling frames,a final representation of the semantic features learned from the image is obtained.

4. The video description method based on deep migration learning of claim 3, characterized in that: step 3) described above, the domain formed by the image samples is called the source domain, the domain formed by the frame stream samples is called the target domain, and the final target of the model is: for the distribution of the target domain, given an input x, the semantic features y can be predicted;

the method comprises the following specific steps:

in the training process, for each input x, a domain label d needs to be predicted besides the semantic features; if d is 0, x is from the source domain; if d is 1, x is from the target domain, and the frame stream semantic feature detection model can be decomposed into three parts, wherein the specific working process is as follows: first, by mapping G_fMapping input x into a D-dimensional feature vector f ∈ R^DThe mapped parameter vector is theta_f(ii) a Then, by mapping G_yMapping the feature vector f into semantic feature y, and mapping the parameter vector theta_y(ii) a Finally, by a mapping G_dMapping the same feature vector f into a domain label d, wherein the mapped parameter vector is theta_d。

5. The video description method based on deep migration learning of claim 4, wherein:

in the training process, the frame flow semantic feature detection model meets the following three parameters:

(1) finding a parameter θ_yMinimizing the loss of the semantic feature predictor in the source domain, and ensuring that the frame stream semantic feature detection model is not distorted in the source domain;

and

where μ is the learning rate, during back propagation, in equation (3) a gradient is taken from the next layer, this gradient is multiplied by- λ and passed to the previous layer, this part is the gradient-reversed layer, the semantic feature detection model includes a feature extractor, a gradient inverseA layer-oriented and domain classifier, a feature extractor for extracting semantic features in the frame stream domain, a domain classifier and a gradient reverse layer are combined to fuse the semantic features of the image domain and the frame stream domain, a frame stream semantic feature predictor is used for predicting the semantic features from a target domain and a source domain sample after training is finished, the semantic features obtained by a frame stream semantic feature detection model can be directly used as the input of a video description frame, and the semantic features are marked as A_iv。

6. The video description method based on deep migration learning of claim 5, wherein: in the step 4, the step of processing the image,

E(v,A_iv,S)＝-logP(S|v,A_iv) (6)

x_-1＝f₁(T_vv)+A_iv (8)

x_t＝f₁(T_sw_t)+A_iv (9)

h_t＝f₂(x_t) (10)

wherein the content of the first and second substances,

and