CN108765383A

CN108765383A - Video presentation method based on depth migration study

Info

Publication number: CN108765383A
Application number: CN201810465849.4A
Authority: CN
Inventors: 张丽红; 曹刘彬
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2018-03-22
Filing date: 2018-05-15
Publication date: 2018-11-06
Anticipated expiration: 2038-05-15
Also published as: CN108765383B

Abstract

The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.Include the following steps, 1）It is vector form by representation of video shot by convolutional neural networks representation of video shot model；2）Image semantic feature detection model is built using multi-instance learning, to extract image area semantic feature；3）By step 2）In image, semantic feature detection model move in frame basin, obtain frame stream semantic feature detection model, to extract frame stream semantic feature, and realize the depth integration of image area and frame basin semantic feature；4）It builds depth migration and learns video presentation frame, generate video natural language description.The present invention carries out depth integration to the semantic feature in input terminal not same area, to improve the accuracy rate for generating video presentation.

Description

Video presentation method based on depth migration study

Technical field

The invention belongs to technical field of video processing, specifically a kind of video presentation method based on depth migration study.

Background technology

Video presentation is to utilize natural language description video, is the emphasis of computer vision and natural language processing field And difficult point, it is had broad application prospects in artificial intelligence field.

Video presentation is very different with iamge description, and video presentation is not only appreciated that the object in each frame, but also It is appreciated that movement of the object between multiframe.Existing video presentation method mainly has following four classes：1) it will be examined in vision content The word measured distributes to each sentence fragment, then goes to generate video presentation using predefined language template.Such methods Highly dependent upon sentence template, the syntactic structure of the sentence of generation is relatively more fixed；2) study vision content is constituted with text sentence The sentence of the probability distribution of joint space, generation has more flexible syntactic structure；3) it goes to train category using multi-instance learning Property detector, then by one based on attribute detector output maximum entropy language model go generate video presentation；4) with volume Centered on product neural network and Recognition with Recurrent Neural Network, by a simple linear transport unit, being dug from image with frame stream The semantic feature dug combines, and generates video presentation.Preceding two classes method does not utilize semanteme during video presentation Feature；Although two class methods consider semantic feature in input terminal afterwards, the semantic feature in not same area is not carried out Depth integration.

Existing video presentation method descriptive semantics are not accurate enough, to improve the accuracy of description, therefore devise a kind of depth Spend transfer learning video presentation model.

Invention content

The present invention to solve the above-mentioned problems, provides a kind of video presentation method learnt based on depth migration.

The present invention takes following technical scheme：A kind of video presentation method based on depth migration study, including following step Suddenly,

1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model；

2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature；

3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature；

4) structure depth migration learns video presentation frame, generates video natural language description.

In the step 1), the task of representation of video shot is completed using convolutional neural networks model, for one in video Group sample frame, each frame is input in convolutional neural networks model, the output of second full articulamentum is extracted, then in institute Mean value pond is executed in some sample frames, is a n-dimensional vector one section of representation of video shot.

In the step 2), structure image semantic feature is gone using multi-instance learning on iamge description standard database Detection model.

It is specific as follows：

For a semantic feature w_aIf w_aIt is present in the mark text description of image I, then image I will be considered as One positive closure；Otherwise, image I will be considered as a negative packet.Each packet is input in image, semantic feature detection model first, Each packet is divided into multiple regions by full convolutional neural networks, packet is then calculated according to the probability of all areas in packet (example) W containing semantic feature_aPacket b_IProbability, as shown in formula (1)：

Wherein,It is feature w_aProbability, this probability is by region r_iIt is predicted, is carried out by one sigmoid layers It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks.In addition, full convolutional neural networks are most The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every For a packet, x × x dimensional feature figures are obtained.Then, model is optimized using cross entropy loss layer.Finally, using in image Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each independent sample frame about all semantemes The probability distribution of feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to language The final expression of adopted feature.

In the step 3), the domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, should The final goal of model is：Distribution for aiming field gives input x, can predict semantic feature y.

It is specific as follows：

In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed；If d= 0, then x is from source domain；If d=1, x come from aiming field, semantic feature detection model can be decomposed into three parts, specific work It is as process：First, by mapping G_fInput x is mapped as a D dimensional feature vector f ∈ R^D, the parameter vector of mapping is θ_f； Then, by mapping G_yFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θ_y；Finally, it is mapped by one G_dIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θ_d。

In the training stage, frame stream semantic feature detection model meets following three parameters：(1) parameter θ is found_y, in source domain Minimize the loss of semantic feature fallout predictor, it is ensured that semantic feature detection model is undistorted in source domain；(2) Feature Mapping is found Parameter θ_fSo that by mapping G in source domain_fThe feature S of extraction_fWith the feature T extracted on aiming field_fIt is similar, it is distributed S_fWith T_f Similitude pass through computational domain grader G_dLoss estimate.It obtains domain invariant features and makes the two feature distributions as far as possible It is similar, the loss of domain grader is maximized with this.(3) parameter θ of domain grader is found_d, minimize the loss of domain grader. Here with the thought of confrontation type network.Three parameters met the requirements constitute a point (θ_f,θ_y,θ_d), referred to as saddle point.It is whole A training process can be expressed as formula (2)：

Wherein, L_y(,) it is the loss that semantic feature is predicted；L_d(,) it is the loss that domain is classified,WithIt indicates the The corresponding loss function assessed on i training sample；Parameter lambda be used to balance the feature in two domains formed during the training period to Amount；Therefore, saddle point (θ_f,θ_y,θ_d) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5) Saddle point；

Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by- λ, and preceding layer is passed it to, this part is the reversed layer of gradient.Frame stream semantic feature detection model mainly by feature extractor, The reversed layer of gradient and domain grader are constituted.Feature extractor mainly extracts the semantic feature in frame basin, domain grader and gradient Reversed layer, which combines, merges image area with frame basin semantic feature.After training is completed, predicted using semantic feature Device goes to predict the semantic feature from aiming field and source domain sample.Due to S_fWith T_fFor the feature vector that two domains are constant, therefore Obtained image area is mapped by them and also remains the constant characteristic in domain with the semantic feature on frame basin, i.e. is extracted on two domains To semantic feature realize depth integration.Therefore, the semantic feature obtained using frame stream semantic feature detection model can be direct It is denoted as A as the input of video presentation frame, and by the semantic feature_iv。

In the step 4), the workflow of entire frame includes the following steps：

(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will It is input to the first layer of recurrent neural network (Long Short Term Memory network, abbreviation LSTM)；

(2) the training image semantic feature detection model on image data set；

(3) given video frame is split as individual image, be sequentially inputted in frame stream semantic feature detection model；

(4) given video frame is considered as frame stream, be input to parallel in frame stream semantic feature detection model；

(5) fusion semantic feature A is obtained using frame stream semantic feature detection model_iv, such as " Man ", " Person " to Amount indicates, and by A_ivIt is input to the second layer of LSTM；

(6) the English description of given video is input to the first layer of LSTM by word, in conjunction with defeated in aforementioned four step Enter, the output word of prediction subsequent time is removed using current time and before the input word at moment, training video is come with this Describing framework.

Model structure represented by entire frame is described by formula (6), (7),

E(v,A_iv, S) and=- logP (S | v, A_iv) (6)

Wherein, v is input video, A_ivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, w_tFor list Word expression, N_sFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word Context relation.

In frame, only video v is input in first layer LSTM units at the t=-1 moment, then by A_ivAs additional Input, is input to second layer LSTM units in each iteration, strengthens semantic information with this, such as formula (8), (9), (10) Shown, t is from 0 to N_s- 1 is iterated：

x_-1=f₁(T_vv)+A_iv (8)

x_t=f₁(T_sw_t)+A_iv (9)

h_t=f₂(x_t) (10)

Wherein,WithIt is the transformation matrix and w of video v respectively_tTransformation matrix, D_eIt is The dimension of LSTM inputs, D_vIt is the dimension of video v, D_wIt is w_tDimension, x_tWith h_tBe respectively second layer LSTM units input with Output, f₁With f₂It is first layer and the mapping function in second layer LSTM units respectively.

Compared with prior art, the present invention constructs a new video presentation model.This model utilizes in transfer learning Depth Domain adaptive method, depth integration is carried out to the semantic feature in input terminal not same area, video presentation is generated to improve Accuracy rate.It is tested in MSVD sets of video data, demonstrates the feasibility and validity of invention, and show to utilize Depth Domain The fusion of semantic feature in not same area can be better achieved in adaptive method, further improve the accuracy rate of video presentation, carry The high generalization ability of network.

Description of the drawings

Fig. 1 is convolutional neural networks representation of video shot model；

Fig. 2 is image, semantic feature detection model；

Fig. 3 is the frame stream semantic feature detection model of the present invention；

Fig. 4 is video presentation frame structure；

Fig. 5 is partial results of the present invention in test data set.

Specific implementation mode

The specific implementation mode of the present invention is described in detail below.

A kind of video presentation method based on depth migration study, includes the following steps,

1) it is vector form by representation of video shot by convolutional neural networks representation of video shot model；Concrete model structure such as Fig. 1 It is shown.

In step 1), the task of representation of video shot is completed using convolutional neural networks model, for one group of sampling in video Each frame is input in convolutional neural networks model by frame, is extracted the output of second full articulamentum, is then adopted in all Mean value pond is executed on sample frame, is a n-dimensional vector one section of representation of video shot.

2) multi-instance learning is utilized to build image semantic feature detection model, to extract image area semantic feature.Image language Adopted feature detection model is as shown in Figure 2.

It is as follows：

For a semantic feature w_aIf w_aIt is present in the mark text description of image I, then image I will be considered as One positive closure；Otherwise, image I will be considered as a negative packet, and each packet is input in image, semantic feature detection model first (as shown in Figure 2) then calculates according to the probability of all areas in packet comprising semantic feature w_aPacket b_IProbability, such as formula (1) It is shown：

Wherein,It is feature w_aProbability, this probability is by region r_iIt is predicted, is carried out by one sigmoid layers It calculates, the sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks are most The dimension of the activation primitive of the latter convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore for every For a packet, x × x dimensional feature figures are obtained, then, model is optimized using cross entropy loss layer, finally, using in image Training obtains image, semantic feature detection model on descriptor data set, respectively calculates each individual sample frame about all languages The probability distribution of adopted feature, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to The final expression of semantic feature.

3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains the detection of frame stream semantic feature Model to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature.Frame semantic feature detects Model is as shown in Figure 3.

The domain that image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, the final goal of the model It is：Distribution for aiming field gives input x, can predict semantic feature y；

It is specific as follows：

In the training stage, semantic feature detection model meets following three parameters：

(1) parameter θ is found_y, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that semantic feature detection model exists It is undistorted in source domain；

(2) Feature Mapping parameter θ is found_fSo that by mapping G in source domain_fThe feature S of extraction_fIt is extracted on aiming field Feature T_fIt is similar, it is distributed S_fWith T_fSimilitude pass through computational domain grader G_dLoss estimate that obtaining domain invariant features makes It is as similar as possible to obtain the two feature distributions, the loss of domain grader is maximized with this；

(3) parameter θ of domain grader is found_d, minimize the loss of domain grader；Three parameters met the requirements constitute one A point (θ_f,θ_y,θ_d), referred to as saddle point, entire training process can be expressed as formula (2)：

Wherein, L_y(,) it is the loss that semantic feature is predicted；L_d(,) it is the loss that domain is classified,WithIt indicates i-th The corresponding loss function assessed on a training sample；Parameter lambda be used to balance the feature in two domains formed during the training period to Amount；Therefore, saddle point (θ_f,θ_y,θ_d) can be solved by formula (2), it is searched for using the method as shown in formula (3), (4), (5) Saddle point；

Wherein μ is learning rate, during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by- λ, and preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model is mainly by feature extractor, gradient Reversed layer and domain grader are constituted, and after training is completed, go prediction to come from aiming field and source domain using semantic feature fallout predictor The semantic feature of sample, the semantic feature obtained using improved semantic feature detection model can be directly as video presentation frames Input, and the semantic feature is denoted as A_iv。

Include the following steps：

(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, and only carving at the beginning will It is input to the first layer of recurrent neural network；

(2) the training image semantic feature detection model on image data set；

(5) fusion semantic feature A is obtained using frame stream semantic feature detection model_iv, and by A_ivIt is input to recurrent neural net The second layer of network；

(6) the English description of given video is input to the first layer of recurrent neural network by word, is walked in conjunction with aforementioned four The output word of prediction subsequent time is removed in input in rapid using current time and before the input word at moment, is come with this Training video describing framework.Video presentation frame structure is as shown in Figure 4.

Model structure represented by entire frame is described by formula (6), (7),

E(v,A_iv, S) and=- logP (S | v, A_iv) (6)

Wherein, v is input video, A_ivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, w_tFor list Word expression, N_sFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained in sentence between word Context relation；

In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by A_ivMake For additional input, it is input to second layer recurrent neural network unit in each iteration, semantic information is strengthened with this, such as Formula (8), (9), (10) are shown, and t is from 0 to N_s- 1 is iterated：

x_-1=f₁(T_vv)+A_iv (8)

x_t=f₁(T_sw_t)+A_iv (9)

h_t=f₂(x_t) (10)

Wherein,WithIt is the transformation matrix and w of video v respectively_tTransformation matrix, D_eIt is recurrence The dimension of neural network input, D_vIt is the dimension of video v, D_wIt is w_tDimension, x_tWith h_tIt is second layer recurrent neural network respectively The input and output of unit, f₁With f₂It is first layer and the mapping function in second layer recurrent neural network unit respectively.

Experiment and interpretation of result

Data set：

In order to evaluate the video presentation model of the present invention, most popular video description data collection MSVD on YouTube is selected. MSVD includes 1970 video clips being collected into from YouTube.There are about 40 available English descriptions for each video.? It in experiment, is trained using 1200 videos, 100 videos are verified, and 670 videos are tested.In addition, also using Image data set COCO.

Evaluation index：

For the video presentation frame that quantitative assessment proposes, there is employed herein common three fingers in video presentation task Mark：BLEU@N (BiLingual Evaluation Understudy), METEOR and CIDEr-D (Consensus-based Image Description Evaluation).For BLEU@N indexs, N takes 3,4.Service hair is evaluated using by Microsoft Coco The code of cloth calculates all indexs.The result of calculation of these three indexs is percentage, and score is higher to indicate that is generated regards Frequency is described closer to reference to description.

Experimental setup：

The present invention is expressed as " one-hot " vector to each 25 frame of video uniform sampling, and by each word in sentence； For representation of video shot, pre-training is carried out to VGG19 on Imagenet ILSVRC12 data sets, then to Fig. 1 on MSVD Shown in model be finely adjusted；In order to indicate the fusion semantic feature extracted from two domains, respectively in COCO image datas Mark semantic feature of 1000 most common words as two domains is selected in collection and MSVD sets of video data^[4], as scheme 2 with the training datasets of two models of Fig. 3.Fig. 2 models are trained on COCO training sets first, then COCO with Fig. 3 models are trained on two training sets of MSVD, generate 1000 final dimension probability vectors；In LSTM, input and The dimension of hidden layer is disposed as 1024.In test phase, using Beam Search search strategies, using being trained in Fig. 4 Model generate new video sentence description, and beam size are set as 4.

Quantitative analysis：

Table 1 illustrates in MSVD test data sets, and video presentation model proposed in this paper exists with existing seven kinds of models Score comparative situation in each evaluation index.Simulation result is all different obtained by different configuration of machine, institute's column data in table It is reference with same machine.

The score of 1 each model of table compares

Tab.1 Score comparison of each model

The method based on attention is utilized in model 1-4 in table, is not introduced into semantic feature；Model 5,6 is merely with single The semantic feature in domain；The semantic feature in two domains is utilized in model 7, and has been carried out simple linear fusion.Comparative analysis Data in table, it can be seen that：In four evaluation indexes, video presentation model proposed in this paper obtains higher score. It follows that：1) in video presentation frame, using high-level semantics feature, visual representation can be enhanced, be conducive to model Learn video presentation；2) unapparent to video presentation performance merely with the semantic feature in single domain (image area or frame basin) It improves；3) simple linear fusion only is carried out to semantic feature in two domains, although improving the indices of video presentation, but still Shortcomings need to improve；4) it is improved using the fusion semanteme characteristic remarkable that the Depth Domain adaptive method in transfer learning obtains Video presentation performance, the i.e. present invention have better effect in terms of semantic feature fusion.

Qualitative analysis：

Fig. 5 illustrates partial results of the video presentation model proposed in this paper in test data set.

Exemplary sample frame is the partial frame of each test video in figure, be can be seen that by these examples and performance Preferable LSTM-TSAIV models compare, and video presentation frame proposed in this paper can more accurately generate test video English description.

Claims

1. a kind of video presentation method based on depth migration study, it is characterised in that：Include the following steps,

3) the image, semantic feature detection model in step 2) is moved in frame basin, obtains new semantic feature detection mould Type to extract frame stream semantic feature, and realizes the depth integration of image area and frame basin semantic feature；

2. the video presentation method according to claim 1 based on depth migration study, it is characterised in that：The step 1) in, the task of representation of video shot is completed using convolutional neural networks model, it is for one group of sample frame in video, each frame is equal It is input in convolutional neural networks model, extracts the output of second full articulamentum, then executed in all sample frames equal It is worth pond, is a n-dimensional vector one section of representation of video shot.

3. the video presentation method according to claim 2 based on depth migration study, it is characterised in that：The step 2) in：

It is as follows：

For a semantic feature w_aIf w_aIt is present in the mark text description of image I, then image I will be considered as one Positive closure；Otherwise, image I will be considered as a negative packet, each packet is input in image, semantic feature detection model first, then It is calculated comprising semantic feature w according to the probability of all areas in packet_aPacket b_IProbability, as shown in formula (1)：

Wherein,It is feature w_aProbability, this probability is by region r_iIt is predicted, is calculated by one sigmoid layers, The sigmoid layers is located at after the last one convolutional layer of full convolutional neural networks, in addition, full convolutional neural networks the last one The dimension of the activation primitive of convolutional layer is x × x × h, and h represents the expression dimension in each region in packet, therefore each packet is come It says, obtains x × x dimensional feature figures, then, model is optimized using cross entropy loss layer, finally, using in iamge description number Image, semantic feature detection model is obtained according to training on collection, each individual sample frame is calculated about all semantic features respectively Probability distribution, and mean value pond is executed in the feature distribution of all sample frames, obtain from image study to semantic spy The final expression of sign.

4. the video presentation method according to claim 3 based on depth migration study, it is characterised in that：The step 3) domain that, image pattern is constituted is known as source domain, and the domain that frame stream sample is constituted is known as aiming field, and the final goal of the model is：It is right In the distribution of aiming field, input x is given, can predict semantic feature y；

It is specific as follows：

In the training process, for each input x, other than semantic feature to be predicted, prediction domain label d is also needed；If d=0, X comes from source domain；If d=1, x come from aiming field, frame stream semantic feature detection model can be decomposed into three parts, specific work It is as process：First, by mapping G_fInput x is mapped as a D dimensional feature vector f ∈ R^D, the parameter vector of mapping is θ_f； Then, by mapping G_yFeature vector f is mapped as semantic feature y, the parameter vector of mapping is θ_y；Finally, it is mapped by one G_dIdentical feature vector f is mapped as domain label d, the parameter vector of mapping is θ_d。

5. the video presentation method according to claim 4 based on depth migration study, it is characterised in that：

In the training process, frame stream semantic feature detection model meets following three parameters：

(1) parameter θ is found_y, minimize the loss of semantic feature fallout predictor in source domain, it is ensured that frame stream semantic feature detection model exists It is undistorted in source domain；

(2) Feature Mapping parameter θ is found_fSo that by mapping G in source domain_fThe feature S of extraction_fWith the spy extracted on aiming field Levy T_fIt is similar, it is distributed S_fWith T_fSimilitude pass through computational domain grader G_dLoss estimate, obtain domain invariant features and make this Two feature distributions are as similar as possible, and the loss of domain grader is maximized with this；

(3) parameter θ of domain grader is found_d, minimize the loss of domain grader；Three parameters met the requirements constitute a point (θ_f,θ_y,θ_d), referred to as saddle point, entire training process can be expressed as formula (2)：

Wherein, L_y(,) it is the loss that semantic feature is predicted；L_d(,) it is the loss that domain is classified,WithIt indicates to instruct at i-th Practice the corresponding loss function assessed on sample；Parameter lambda is used to balance the feature vector in two domains formed during the training period；Cause This, saddle point (θ_f,θ_y,θ_d) can be solved by formula (2), saddle point is searched for using the method as shown in formula (3), (4), (5)；

Wherein μ is learning rate, and during backpropagation, from next layer of acquirement gradient in (3) formula, this gradient is multiplied by-λ, and Preceding layer is passed it to, this part is the reversed layer of gradient, and semantic feature detection model includes feature extractor, the reversed layer of gradient With domain grader, feature extractor extracts the semantic feature in frame basin, and domain grader and the reversed layer of gradient combine to figure Image field is merged with frame basin semantic feature, after training is completed, goes prediction to come from mesh using frame stream semantic feature fallout predictor The semantic feature for marking domain and source domain sample, the semantic feature obtained using frame stream semantic feature detection model can directly as regarding The input of frequency describing framework, and the semantic feature is denoted as A_iv。

6. the video presentation method according to claim 5 based on depth migration study, it is characterised in that：The step 4) in：

Include the following steps：

(1) vector that given video is obtained using convolutional neural networks representation of video shot model indicates v, only carves at the beginning its is defeated Enter the first layer to recurrent neural network；

(2) the training image semantic feature detection model on image data set；

(5) fusion semantic feature A is obtained using frame stream semantic feature detection model_iv, and by A_ivIt is input to recurrent neural network The second layer；

(6) the English description of given video is input to the first layer of recurrent neural network by word, in conjunction in aforementioned four step Input, the output word of prediction subsequent time is removed using current time and before the input word at moment, is trained with this Video presentation frame.

7. the video presentation method according to claim 6 based on depth migration study, it is characterised in that：The step In 4,

Model structure represented by entire frame is described by formula (6), (7),

E(v,A_iv, S) and=- logP (S | v, A_iv) (6)

Wherein, v is input video, A_ivTo merge semantic feature, S describes for sentence, and E is energy damage threshold, w_tFor word list Show, N_sFor the quantity of word in sentence, final target is to minimize energy damage threshold, is retained upper between word in sentence Hereafter relationship；

In frame, only video v is input in first layer recurrent neural network unit at the t=-1 moment, then by A_ivAs volume Outer input is input to second layer recurrent neural network unit in each iteration, strengthens semantic information with this, such as formula (8), (9), (10) are shown, and t is from 0 to N_s- 1 is iterated：

x_-1=f₁(T_vv)+A_iv (8)

x_t=f₁(T_sw_t)+A_iv (9)

h_t=f₂(x_t) (10)

Wherein,WithIt is the transformation matrix and w of video v respectively_tTransformation matrix, D_eIt is recurrent neural net The dimension of network input, D_vIt is the dimension of video v, D_wIt is w_tDimension, x_tWith h_tIt is second layer recurrent neural network unit respectively Input and output, f₁With f₂It is first layer and the mapping function in second layer recurrent neural network unit respectively.