CN111985612B

CN111985612B - Encoder network model design method for improving video text description accuracy

Info

Publication number: CN111985612B
Application number: CN202010706821.2A
Authority: CN
Inventors: 朱虹; 熊鸽; 潘晓容; 杨恺庆; 刘晶晶; 杜森; 王栋
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2024-02-06
Anticipated expiration: 2040-07-21
Also published as: CN111985612A

Abstract

The invention discloses a design method of an encoder network model for improving the description accuracy of video texts, which comprises the following steps: step 1, constructing a visual feature extraction encoder of a video, step 2, constructing a semantic feature extraction encoder of the video, step 3, acquiring semantic features, and step 4, training an S-LSTM network model. In the method, in a coding network model, more accurate semantic information is extracted by utilizing the characteristics of the video, and the difference between semantic words is amplified, so that more accurate semantic characteristics are obtained; after the semantic features are input to a decoding network, carrying out text description on the video; compared with the algorithm index published in the mainstream paper searched at present, the accuracy of the text description of the video is obviously improved.

Description

Encoder network model design method for improving video text description accuracy

Technical Field

The invention belongs to the technical field of video text description algorithms, and relates to an encoder network model design method for improving video text description accuracy.

Background

Video text description algorithm refers to the language description of automatically outputting video content for a given piece of video. The video text description algorithm has important significance and wide application in practice. For example, in the face of massive video data, video text description can be utilized to rapidly analyze video clicked by a user, so that personalized service is carried out for the user; and the text description generated by the video text description algorithm can be utilized to intelligently audit the short video uploaded by the user. In addition, video text descriptions have important applications in early education assistance for young children, video retrieval, helping the amblyopia population to better acquire information, and the like.

In the process of video text description, the video needs to be converted into text to be output, so that accurate extraction of semantic information contained in the video plays an important role. Accurate semantic information is a precondition for outputting a video text description, and the work of the part is completed in an encoder of a model, but the prior art has the defects of inaccurate output information and low output speed in the aspect.

Disclosure of Invention

The invention aims to provide a design method of an encoder network model for improving the accuracy of video text description, which solves the problem of inaccurate output text description caused by inaccurate semantic information extraction in the video text description process in the prior art.

The technical scheme adopted by the invention is that the encoder network model design method for improving the video text description accuracy is implemented according to the following steps:

step 1, constructing a visual characteristic extraction encoder of a video,

step 2, constructing a semantic feature extraction encoder of the video,

step 3, obtaining the semantic features,

and 4, training an S-LSTM network model.

The method has the advantages that in the coding network model, more accurate semantic information is extracted by utilizing the characteristics of the video, and the difference between semantic words is amplified, so that more accurate semantic characteristics are obtained; after the semantic features are input to a decoding network, carrying out text description on the video; compared with the algorithm index published in the mainstream paper searched at present, the accuracy of the text description of the video is obviously improved.

Drawings

FIG. 1 is a block diagram of the overall architecture of a video semantic extraction encoding network model of the method of the present invention;

FIG. 2 is a flow chart of the structure of the Highway Layer module in the video semantic extraction coding network model of the method of the invention;

FIG. 3 is a flow chart of the structure of the enlarged word difference module in the video semantic extraction encoding network model of the method of the present invention;

fig. 4 is a flow chart of the structure of a decoding network model employed by the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention relates to a design method of an encoder network model for improving the accuracy of video text description, which is implemented according to the following steps:

step 1, constructing a visual characteristic extraction encoder of a video, wherein the specific process is that,

1.1 A) a training data set is established,

training a data set of a deep learning network, wherein the data set is formed by a large number of marked samples, and the self marking is considered to have certain limitation and has huge workload, so that the data set published in the step is selected as the sample of the training set;

the embodiment selects the video samples of the published MSVD data set and the visual features of a plurality of videos of the MSRVTT data set as the samples of the training set of the semantic feature extraction network; in the MSRVTT data set, after randomly selecting the data sample as the training set, the left part of video sample is used as the sample of the verification set;

1.2 A semantic dictionary is established,

the most commonly used K words are selected from the labels of the samples of the training set and the samples of the verification set to form a semantic concept set (K is an empirical value, and is selected according to different text description fields, wherein the value range of K is preferably [250,400 ]),

assuming that the total number of samples of the training set is N, for the ith video sample, i=1, 2, & gt, N, performing semantic dictionary labeling on the samples of the training set by using the selected K words, wherein the calculation formula of the semantic attribute labeling is as follows:

Y _i ＝[y _i,1 ,y _i,2 ,...,y _i,K ]，i＝1,2,...,N (1)

wherein,

1.3 A) constructing a visual feature extraction encoder,

the basic model structure of the visual characteristic extraction encoder constructed in the step is a two-dimensional convolution network ResNeXt and a three-dimensional convolution network ECO (the two convolution networks are both in the prior art and can be searched in published papers and professional books), and M is taken as the output of a pooling layer of conv5/block3 in the two-dimensional convolution network ResNeXt ₁ The dimension feature vector describes the two-dimensional visual characteristics of the video, M according to the ResNeXt network structure ₁ Preferably 2048; taking M of global pooling layer output of three-dimensional convolution network ECO ₂ The dimensional feature vector describes the three-dimensional visual characteristics of the video, M, according to the ECO network structure ₂ Preferably 1536; after the two feature vectors are spliced, the two feature vectors are used as visual features of videos, and the visual feature vector corresponding to each video is m=m ₁ +M ₂ The dimensions are such that the visual characteristic expressions of the N samples of the training set are as shown in equation (3):

X _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]，i＝1,2,...,N (3)

step 2, constructing a semantic feature extraction encoder of the video, wherein the specific process is that,

2.1 A structure of a video semantic feature extraction network is designed,

the structure of the video semantic feature extraction network is shown in fig. 1, and the visual features X obtained in the step 1 are obtained _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]As input, through two FC layers (namely FC1, FC 2) to reduce the dimension, the M is obtained _s Dimension (M) _s For empirical values, M is preferred _s =512) is written asThen through a Highway Layer module and an FC3 Layer, M is obtained through Sigmoid activation operation _c Dimension (M) _c For empirical values, M is preferred _c =300) feature vectorFinally obtaining M through a word difference amplifying module _c Dimension semantic feature vector, denoted ++>

2.1.1 A Highway Layer module is built up,

visual characteristic X obtained from step 1 _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]I=1, 2, and N, which generally includes a large amount of redundant information, in the structure shown in fig. 1, in this step, in addition to dimension reduction of two fully connected layers, coding of visual features is enhanced by a Highway Layer module, so that the obtained visual features of the video are more accurate, and the Highway Layer module is described in detail below;

as shown in fig. 1, the input of the Highway Layer module is the visual feature X _i M obtained after FC1 and FC2 operations _s Dimension feature vectorAs shown in fig. 2, the Highway Layer module is composed of two parts, a Transform gate and a Carry gate:

the Transform gate part of one is for the input information H _i After FC4 full connection operation, sigmoid nonlinear transformation is carried out to obtain feature vectors, which are recorded as,the calculation formula is as follows:

L _i ＝σ(W _t ·H _i +b _t )，i＝1,2,...,N (4)

wherein sigma represents Sigmoid activation function, parameter W _t And b _t The method is obtained by network training;

the output of the Carry gate portion of two is J _i ＝1-L _i I=1, 2,..n, denoted as

Will input information H _i The feature vector obtained after FC5 nonlinear transformation is recorded asThe calculation formula is as follows:

D _i ＝f _act (W _g ·H _i +b _g )，i＝1,2,...,N (5)

wherein f _act Representing ReLU activation function, parameter W _g And b _g The method is obtained by network training;

the Highway Layer module inputs the characteristic H by using the Transform gate and the Carry gate _i And non-linearly transformed feature D _i Fusing to obtain output vector of Highway Layer module, and recording asThe calculation formula is as follows:

U _i ＝L _i ·D _i +J _i ·H _i ，i＝1,2,...,N (6)

2.1.2 A magnified word difference module is constructed,

in order to accurately predict semantic information contained in video features, the step takes semantic attribute labels as guidance to predict words contained in video visual features, and video semantic features with rich semantic information are obtained. However, in the process of predicting words, if the probabilities between the predicted words are not great, the words used for representing the features of the video cannot express the semantic information contained in the video well. Therefore, the step adds an amplified word difference module in the video semantic extraction network to amplify the difference between semantic words so as to obtain words which are more important for expressing the video semantic information.

The structure of the amplified word difference module is shown in FIG. 3, and the input of the amplified word difference module is the feature vector U output by the Highway Layer module in step 2.1.1) _i U is set up _i After passing through a full connection layer FC3, activating by a function Sigmoid to obtain M _c Dimension feature vector, denoted as R is R _i Two different expression mode feature vectors are obtained through two full connection layers of FC6 and FC7 respectively and are respectively marked as +.>And->The calculation formula is as follows:

A _i ＝W _a ·R _i +b _a ，i＝1,2,...,N (7)

B _i ＝W _b ·R _i +b _b ，i＝1,2,...,N (8)

wherein the parameter W _a 、W _b 、b _a And b _b The method comprises the steps of training through a network to obtain;

feature vector A _i And feature vector B _i Performing dot multiplication to obtain feature vectors T _i The score of the word in each dimension is obtained through the Sigmoid activation operationThe calculation formula is as follows:

t _i,ks ＝a _i,ks ·b _i,ks ，i＝1,2,...,N，ks＝1,2,...,M _c (9)

q _i,ks ＝σ(t _i,ks )，i＝1,2,...,N，ks＝1,2,...,M _c (10)

wherein σ represents a Sigmoid activation function;

the dot multiplication operation is adopted, so that the dimension with larger value in the feature is larger and smaller, the difference of words is amplified, and the result after the Sigmoid function represents the score of each dimension of the word feature;

scoring wordsAnd vector of input->Semantic feature vector +.>The calculation formula is as follows:

z _i,ks ＝q _i,ks ·r _i,ks ，i＝1,2,...,N，ks＝1,2,...,M _c (11)

score Q _i And word semantic features R _i The corresponding elements of the word feature are multiplied, the feature selection effect is achieved on the word feature dimension, and the difference between different word features is amplified;

2.2 Setting parameters of the training network model,

the training process parameters are experience values, the preferred values of the step are Epoch epsilon [500,2000], the learning rate epsilon [0.0001,0.0005], the Batch size is selected according to the hardware conditions used in calculation, and the preferred values are Batch size epsilon {64, 128, 256, 512};

step 3, obtaining semantic features, wherein the specific process is that,

3.1 A plurality of semantic feature models are selected,

when training the semantic feature extraction network model in the step 2, the step extracts the model with multiple results in the iterative process, namely selecting the iteration to the Nth ₁ Inferior, nth ₂ Sub..n., th _d The next model is saved, N ₁ ，N ₂ ，...，N _d All are empirical values, preferably d= 5,N ₁ ＝100，N ₂ ＝200，N ₃ ＝400，N ₄ ＝800，N _d ＝1000；

3.2 A) obtaining the semantic features,

the video feature X obtained in the step 1 is processed _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]Respectively inputting the d models obtained in the step 3.1) to obtain d semantic features, and recording the d semantic features as Computing different models to generate semantic features Z _i (k _d ) The average mean value precision mean Average Precision (mAP) of (2) is used for measuring the precision of generating semantic features, and finally the semantic feature with the highest mAP is selected as the corresponding semantic feature attribute of the video sample (five results are selected from five results in each training sample), and is recorded as->

The mAP calculation process is found in the relevant professional literature, and for convenience of explanation, this step is listed below:

wherein K is the number of words in the constructed semantic dictionary, namely the number of semantic tags, AP _k In order to measure the label prediction accuracy index of the kth word, the calculation formula is as follows:

wherein N is _G To predict the correct number of categories, P, for corresponding tags under conditions of changing decision boundaries _k,j For the correct accuracy of label prediction for the kth word given the jth decision boundary condition, the calculation formula is as follows:

wherein TP _k,j Predicting the correct number of samples for the label of the kth word given the jth decision boundary condition, FP _k,j Predicting the number of false samples for the label of the kth word given the jth decision boundary condition;

step 4, training an S-LSTM network model, wherein the specific process is that,

4.1 A) feature-encoding the word,

taking all preset labels of the training sample as a corpus, and word segmentation is carried out on the pre-labeled labels in the corpus to obtain K _s The words are ordered from big to small according to the use frequency, and the front K is selected _w The words being lexicon, K _w For empirical values, this step is preferablyThe rest of K _s -K _w For individual words<unk>Representing a terminator of a sentence<eos>A representation; an integer index mark is carried out on each word in the dictionary from 0;

in this step, word feature encoding uses the word embedded pre-training model GloVe issued by Stanford university (the network model can be used in published papers and professional books)Retrieve) the realization, i.e. obtain K _w M of individual words _c The coding features of the dimensions are expressed as

4.2 Training an S-LSTM network model,

the S-LSTM network model is shown in FIG. 4, and the training sample set is { X ] _i ,Z _i *,S _i I=1, 2,.. _i Z is the visual characteristic of the video extracted in the step 1 _i * S for the video semantic features obtained in the step 3 _i ＝{s _i,0 ,s _i,1 ,...,s _i,t The } is a descriptive statement of the training video, which is the word feature obtained in step 4.1) The word sequence that constitutes, namely:

inputting the training samples into an S-LSTM network model shown in FIG. 4, and training the training samples;

so far, the trained coding network is used for obtaining the semantic features and visual features of the video, and the semantic features and visual features are input into the decoding network S-LSTM to complete the description of the video text.

Compared with the algorithm index published by the currently searched main stream paper, the accuracy of the text description of the video is obviously improved, the CIDEr is evaluated by multiple tests in the international standard test set MSVD and is up to 106.5, and the CIDEr is evaluated by multiple tests in the standard test set MSRVTT and is up to 54.0.

Claims

1. The encoder network model design method for improving the video text description accuracy is characterized by comprising the following steps of:

1.1 A) a training data set is established,

selecting a published MSVD data set video sample and visual features of a plurality of videos of the MSRVTT data set as samples of a training set of a semantic feature extraction network; in the MSRVTT data set, after randomly selecting the data sample as the training set, the left part of video sample is used as the sample of the verification set;

1.2 A semantic dictionary is established,

selecting K most common words from the labels of the samples of the training set and the samples of the verification set to form a semantic concept set,

Y _i ＝[y _i,1 ,y _i,2 ,...,y _i,K ]，i＝1,2,...,N(1)

wherein,

1.3 A) constructing a visual feature extraction encoder,

the basic model structure of the visual characteristic extraction encoder constructed in the step is a two-dimensional convolution network ResNeXt and a three-dimensional convolution network ECO, and M is taken as the output of a pooling layer of conv5/block3 in the two-dimensional convolution network ResNeXt ₁ The dimensional feature vector describes two-dimensional visual characteristics of the video; taking M of global pooling layer output of three-dimensional convolution network ECO ₂ The dimensional feature vector describes three-dimensional visual characteristics of the video; after the two feature vectors are spliced, the two feature vectors are used as visual features of videos, and the visual feature vector corresponding to each video is m=m ₁ +M ₂ The dimensions are such that the visual characteristic expressions of the N samples of the training set are as shown in equation (3):

X _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]，i＝1,2,...,N； (3)

2.1 A structure of a video semantic feature extraction network is designed,

the visual characteristic X obtained in the step 1 is obtained _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]As input, through full connection layers FC1, FC2 to reduce dimension, M is obtained _s The full connection feature of the dimension is noted as Then through a Highway Layer module and an FC3 Layer, M is obtained through Sigmoid activation operation _c Dimension feature vector +.>Finally obtaining M through a word difference amplifying module _c Dimension semantic feature vector, denoted ++>

2.1.1 A Highway Layer module is built up,

the input of the Highway Layer module is the visual feature X _i M obtained after FC1 and FC2 operations _s Dimension feature vectorThe Highway Layer module consists of two parts, namely a Transform gate and a Carry gate:

L _i ＝σ(W _t ·H _i +b _t )，i＝1,2,...,N (4)

D _i ＝f _act (W _g ·H _i +b _g )，i＝1,2,...,N (5)

U _i ＝L _i ·D _i +J _i ·H _i ，i＝1,2,...,N； (6)

2.1.2 A magnified word difference module is constructed,

adding an amplified word difference module into a video semantic extraction network to amplify the difference between semantic words so as to obtain words which are more important to the expression of video semantic information;

step 3, obtaining semantic features, wherein the specific process is that,

3.1 A plurality of semantic feature models are selected,

when training the semantic feature extraction network model in the step 2, the step extracts the model with multiple results in the iterative process, namely selecting the iteration to the Nth ₁ Inferior, nth ₂ Sub..n., th _d The next model is saved, N ₁ ，N ₂ ，...，N _d Are all empirical values, where d= 5,N ₁ ＝100，N ₂ ＝200，N ₃ ＝400，N ₄ ＝800，N _d ＝1000；

3.2 A) obtaining the semantic features,

the video feature X obtained in the step 1 is processed _i ＝[x _i,1 ,x _i,2 ,...,x _i,M ]Respectively inputting the d models obtained in the step 3.1) to obtain d semantic features, and marking the d semantic features as Z _i (k _d )＝[z _i,1 (k _d ),z _i,2 (k _d ),...,z _i,Mc (k _d )]，i＝1,2,...,N，k _d =1, 2, d, the step of setting the position of the base plate, computing different models to generate semantic features Z _i (k _d ) The average mean value precision mAP of (a) is used for measuring the precision of generating semantic features, and finally, the semantic feature with the highest mAP is selected as the semantic feature attribute corresponding to the video sample and is recorded as

The mAP calculation process is as follows:

step 4, training an S-LSTM network model, wherein the specific process is that,

4.1 A) feature-encoding the word,

taking all preset labels of the training sample as a corpus, and word segmentation is carried out on the pre-labeled labels in the corpus to obtain K _s The words are ordered from big to small according to the use frequency, and the front K is selected _w The words being lexicon, K _w For empirical values, the rest of K _s -K _w For individual words<unk>Representing a terminator of a sentence<eos>A representation; an integer index mark is carried out on each word in the dictionary from 0;

the feature coding of the words is realized by embedding words issued by Stanford university into a pre-training model GloVe, namely K is obtained _w M of individual words _c The coding features of the dimensions are expressed as

4.2 Training an S-LSTM network model,

in the S-LSTM network model, the training sample set is { X ] _i ,Z _i *,S _i I=1, 2,.. _i Z is the visual characteristic of the video extracted in the step 1 _i * S for the video semantic features obtained in the step 3 _i ＝{s _i,0 ,s _i,1 ,...,s _i,t The } is a descriptive statement of the training video, which is the word feature obtained in step 4.1)k _c ＝1,2,...,K _w The word sequence that constitutes, namely:

inputting the training samples into an S-LSTM network model, and training the S-LSTM network model;