CN114943921A

CN114943921A - Video text description method fusing multi-granularity video semantic information

Info

Publication number: CN114943921A
Application number: CN202210610447.5A
Authority: CN
Inventors: 王笛; 王泉; 万波; 雒孝通; 田玉敏; 罗雪梅; 王义峰; 吴自力; 赵辉; 潘蓉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-26

Abstract

The invention provides a video text description method fusing multi-granularity video semantic information, which mainly solves the problems of incomplete video semantic information, insufficient semantic information fusion and information redundancy when various semantic information exists in the prior art. The implementation scheme is as follows: 1) establishing a data set, and respectively extracting three-level spatial time sequence characteristics of a video sample by using three neural network pre-training models; 2) constructing a video text description network fusing multi-granularity video semantic information, and constructing a loss function of the video text description network; 3) training a video text description network fusing multi-granularity video semantic information; 4) and inputting the video to be described into the trained network to generate the text description of the video. The method increases the types of the extracted video features, fully fuses the features, selects proper fusion features to generate the text, not only can effectively represent the video information, but also reduces the information redundancy degree, and can be used for generating the text which has high accuracy and is smooth and describes the video content.

Description

Video text description method fusing multi-granularity video semantic information

Technical Field

The invention belongs to the technical field of computers, and further relates to a video text description method which can be used for generating a high-accuracy and smooth description video content text.

Background

With the rapid development of scientific information technology and the rise of short video trends, videos are more and more common in the life of people. The video is a product created artificially, aims to transmit information and entertainment, and has diversity and multi-field characteristics. The method has important significance and wide application value in life for accurately understanding the video content, and in the field of the Internet, the video understanding technology is widely applied in the fields of video searching, video abstraction, question and answer systems and the like; in the field of security protection, the video understanding technology can be used for identifying abnormal events, analyzing people and vehicles and the like; in the aspect of helping the disabled, the video understanding technology is used for navigating the blind, converting a film or short video into characters, and then telling the blind in a voice broadcasting mode by the character-to-voice technology, so that the video content is described by using the characters to play an important role, and the video text description is a popular research direction in the field of computer vision.

The purpose of the video text description is that a computer automatically generates a complete and natural sentence to describe the video content given a piece of video, and the higher the accuracy and fluency of the text description is, the better the text description is. With the rapid development of deep learning in recent years, the use of a deep neural network framework becomes a mainstream way to handle this task, such as "encoder-decoder", but the existing video text description method still has the following problems: video content in a natural scene is complex, the number of entities is large, the relation among the entities is complex, most of the existing methods have fewer extracted video semantic feature types, and video information cannot be comprehensively represented; insufficient fusion of various video semantic features results in low accuracy of the generated video text description. Aiming at the problems, how to more comprehensively extract the semantic features of the video, and fully fuse the semantic information of the video to generate the video text description with logic is a key problem to be solved by the current video text description.

The university of sienna siemens in its application number: CN202010706821.2 patent document discloses a video text description coding and decoding method, which includes the following steps: extracting video features by using a two-dimensional residual convolutional neural network RexNet and a three-dimensional convolutional neural network ECO to obtain video semantic feature information, and then decoding by adopting an S-LSTM network to obtain text description of the video, wherein more accurate semantic features are obtained by increasing the difference between words in the encoding stage. Although the method provides that the difference between words is increased to obtain more accurate semantic features, the extracted video features are few in types, the video cannot be represented more comprehensively, and the problems that various video semantic information is not fully fused and text semantic information is not fully utilized exist, so that the generated video text is low in description accuracy.

Wenjie Pei, Jiyuan Zhang, Xiangrong Wang et al, in CVPR published paper "Memory-attached Recurrent Network for Video hosting" proposed a Memory participation in a cycle Network MARN model, which mainly consists of three parts, namely an encoder, a cycle encoder based on attention mechanism and an encoder with a Memory, and establishes a mapping relationship by using words and related Video contents in a vocabulary by using the designed Memory. In the model, ResNet-101 is used as a 2-dimensional convolutional neural network to extract two-dimensional visual information of a video, ResNeXt-101 is used as a 3-dimensional convolutional neural network to extract three-dimensional visual information of the video, and in experiments later in the article, the operation result of the model is considered to be better and better along with the increase of the types of the video features. However, the video features used by the model are not fully extracted, and the three-dimensional motion semantic information included in the extracted visual three-dimensional features is insufficient, so that the accuracy of generating the video description text is not high.

In a paper, "Syntax-Aware Action Targeting for Video capturing" published by CVPR, the article proposes a semantic Action guidance SAAT model, extracts Video appearance features using an inclusion-respet v2 neural network, extracts Video Action features using a C3D three-dimensional neural network, extracts regional target features of Video intermediate frames using a fast R-CNN network, sets up 10 regional targets at most, and then uses a transform coding part to respectively fuse the regional target features, the appearance features and the Action features, and then generates a Video text description through a long-short term memory network LSTM decoder. Although various video features are used in the method, when the regional target features are extracted, the target regions are few because only the intermediate frames of the video are used when the local features are extracted, the regional target features of the video cannot be more comprehensively represented, and the direct fusion of the various features has redundancy of information, so that the accuracy of the description of the generated video text is low.

In summary, the existing video text description method has the disadvantages that the extracted video features are of a small variety, which results in incomplete video semantic information, and in the existing fusion method for multiple video text descriptions, when multiple video features exist, the video semantic features are not sufficiently and effectively fused, which causes the problem of information redundancy, causes interference in the generated video text description, and causes low accuracy of the generated text.

Disclosure of Invention

The invention aims to provide a video text description method fusing multi-granularity video semantic information, which aims to solve the problems that the video characteristics provided by the prior art are less in variety, the video semantic information is incomplete, and the characteristics are not fully fused or the information is redundant when various video semantic characteristics exist, and improve the accuracy of text generation.

The technical idea for realizing the invention is that video key frame images are extracted from an original video in an 'equal interval frame number' intensive sampling mode, three neural network pre-training models are respectively used for extracting global characteristics, local characteristics with object types and space positions and action characteristics reflecting video time sequence information, and video characteristic information is more comprehensively represented, so that fine-grained modeling is carried out on video modal data space time sequence information with integrity, time sequence and redundancy, and comprehensive video unified representation is obtained; the three video characteristics are fully fused through a self-attention mechanism method, the generated three fusion characteristics are mutually linked with three parts of speech information of a text, the accuracy of the generated text is improved, a selection module is combined to select the fusion characteristic most relevant to the previous word to predict the next word, the fusion characteristics are selectively used, and the problem of information redundancy caused by the existence of various video semantic information is solved.

The technical scheme of the invention comprises the following steps according to the above thought:

(1) establishing a training set:

(1a) selecting at least 1200 videos to be described, manually annotating and labeling at least 20 sentences of the videos by using natural language text annotation, wherein the number of words of the natural sentence text annotation of each sentence of video is not more than 30 words, and generating at least 42000 pairs of video natural language text pairs;

(1b) for each sentence of text annotation in each pair of samples, marking the part of speech of POS of each sentence of text annotation by using a part of speech tag tool provided by a SpaCy natural language toolkit, wherein each pair of samples is formed into a form of 'video-text-part of speech tag';

(1c) counting the types of all words appearing in the video text description, numbering from 0 to form a dictionary, wherein the dictionary is in the form of: { number: words, replacing the texts in the sample pairs with dictionary numbers according to the dictionary to obtain a training set;

(2) three neural network pre-training models are utilized to respectively extract three-level spatial time sequence characteristics of a video sample:

(2a) extracting N frames of images from each video in a mode of 'equal interval frame' sampling in a training set to serve as key frame images of the videos;

(2b) inputting the key frame images extracted in the step (2a) into the existing trained 2-dimensional convolutional neural network pre-training model increment-ResNet V2, and extracting 1536-dimensional features of each image as global features V _a ∈R ^N×1536 Extracting a global feature dimension N x 1536 from each video;

(2c) inputting the key frame images extracted in the step (2a) into an existing trained neural network pre-training model fast-RCNN, extracting M local regions of each image, wherein the characteristic dimension of each region is 2048 dimensions and is used as a local characteristic V _o ∈R ^N×M×2048 Extracting local region feature dimensions from each video, wherein the dimensions are N × M2048;

(2d) inputting the videos in the training set into the pre-training model I3D of the existing trained 3-dimensional convolutional neural network, taking the time positions of the key frames extracted in the step (2a) as N time points, extracting 1024-dimensional features of each time point in the videos, and taking the 1024-dimensional features as action features V of each video along the time sequence _m ∈R ^N×1024 And the action feature dimension extracted from each video is N × 1024.

(3) Constructing a video text description network fusing multi-granularity video semantic information:

(3a) establishing a video semantic feature embedding module formed by connecting a global semantic information embedding submodule, an action semantic information embedding submodule and a local semantic information embedding submodule in parallel;

(3b) constructing a video semantic information fusion module formed by connecting a noun fusion module, a verb fusion module and a logic connecting word fusion module in parallel;

(3c) constructing a video semantic fusion feature selector module;

(3d) constructing a decoder module consisting of long and short memory network neurons LSTMCell and linear layers and used for outputting short-term memory

And long term memory

(3e) And cascading the video semantic feature embedding module, the semantic information fusion module, the fusion feature selector module and the decoder module to form a video text description network of multi-granularity video semantic information.

(4) Defining a video text description network loss function fusing multi-granularity video semantic information:

wherein the content of the first and second substances,

is a cross entropy loss function, represents the loss of textual statement information,

the KLD loss function represents the loss of the part of speech information, and lambda is a set weight parameter;

(5) training a video text description network:

(5a) extracting three-level characteristics of the video from the training set according to the step (2);

(5b) randomly selecting the extracted features and text descriptions and part-of-speech tags in the training set in the step (5a), inputting the selected features and part-of-speech tags into a video text description network in batches, and iteratively updating parameters in the current network by using an Adam optimization algorithm and a gradient descent method until a loss function is converged or the training times are finished to obtain a trained video text description network;

(6) and the user generates a video description sentence by using the trained video text description network:

(6a) extracting global features, local features and action features of videos submitted by users by using three existing neural network pre-training models respectively by adopting the same method as the method (2);

(6b) inputting the three extracted features of (6a) into a trained video text description network, and outputting description sentences corresponding to the video.

Compared with the prior art, the invention has the following advantages:

firstly, the invention uses three different types of neural network pre-training models to respectively extract global features, local features and action features in each video sample, and uses a mode of extracting video image frames at equal intervals as key frames of a video to perform fine-grained modeling on spatial time sequence information of video modal data so as to obtain comprehensive video unified representation, thereby solving the problem that the complex spatiotemporal semantic features of the video cannot be comprehensively represented in the conventional video text description technical method, fully mining the time sequence information and target features hidden in the video from the perspective of global-local as main input features of a network model, and further improving the accuracy of generating video text description.

Secondly, the video text description network fusing multi-granularity video semantic information is combined with the video semantic feature selector module, and the relevance among different video semantic information is mined by adopting an attention-based mechanism method, so that the limitation of single video semantic information is overcome, and the global information, the action information and the local information of the video can be mutually corresponding to the part-of-speech information such as noun verb and the like for generating the text.

Thirdly, the fusion network module is combined with the selector module, and related video semantic features are selectively used, so that the generated words are more related to the video semantic information, the problem of information redundancy when multiple video semantic information is used in the prior art is solved, the relationship between the generated current word and the previous word is tighter, and the accuracy and the fluency of the description of the generated video text are effectively improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this example are as follows:

step 1, establishing a training set.

(1.1) selecting at least 1200 videos to be described and description text data corresponding to the videos to form a sample set from an existing MSVD data set, wherein at least 20 corresponding artificially labeled natural language texts in each video are annotated, the number of words of each natural language text is not more than 30 words, and at least 42000 pairs of samples are generated;

(1.2) extracting POS part-of-speech tags of texts in sample pairs by using part-of-speech tags provided in a SpaCy natural language processing package, wherein an 'NN' tag represents a singular noun and a 'JJ' tag represents an adjective represented by a number '0', a 'VB' tag represents a verb represented by a number '1', and tags of other parts-of-speech are represented by a number '2';

(1.3) counting all the word types appearing in the text description, numbering from 0 to form a dictionary, using a "< start >" label at the beginning of each sentence in the training data set to indicate the beginning of a sentence, using a "< end >" label at the end of the text description to indicate the end of a sentence, using a "< unk >" label for words not appearing in the dictionary, using a "< pad >" label for blank text, and having a word type number of 7351 in the text, the total length 7355 of the dictionary is 7351+4, and the content form of the dictionary is { number: words };

suppose a description text X of a certain video _i The description is numbered and labeled by using a created dictionary, and then a sentence of text description is represented as follows:

X _i ＝{c _i,1 ,c _i,2 ,...,c _i,t ,...,c _i,L }

wherein c is _i,t The number of the t-th word in the dictionary, denoted as the i-th text, i 1,2.

Step 2: and respectively extracting three-level spatial time sequence characteristics of the video sample by using three neural network pre-training models.

The three neural network pre-training models are as follows:

the first one is a 2-dimensional convolutional neural network pre-training model increment-ResNetV 2;

the second is a 3-dimensional convolution neural network pre-training model I3D;

the third is a target detection neural network pre-training model, fast-RCNN.

The specific implementation of this step is as follows:

(2.1) sequentially selecting one video from the training set in an equal interval frame sampling mode, and calculating the specific time position of each video:

where T represents the time of the video, V represents the frame rate of the video, and N represents the number of intervals set to 26, P in the experiment _i Represents the ith specific time position in the video, i is 0,1,2.. N;

(2.2) extracting N frames of images of the video from the calculated N specific time positions as key frames by using an existing OpenCV tool;

and (2.3) respectively inputting the N key frame images extracted in the step (2.2) into the three neural network pre-training models, and extracting global features, local features and action features of the video:

(2.3.1) inputting the N key frame images into a 2-dimensional convolution pre-training neural network model increment-ResNet V2, extracting 1536-dimensional features of each image, and obtaining the global feature V of the video _a ∈R ^N×1536 ；

(2.3.2) inputting the N key frame images into a target detection neural network pre-training model fast-RCNN, extracting M local regions of each image, and extracting the local characteristic dimension of each regionIs 2048-dimensional characteristic, and obtains the local characteristic V of the video _o ∈R ^N×M×2048 M is set to 36 in this example;

(2.3.3) inputting each video into a 3-dimensional convolutional neural network pre-training model I3D, extracting the dynamic behavior characteristics of each video along a time sequence to obtain the action characteristics V of the video _m ∈R ^N×1024 ；

(2.4) connecting three video features extracted from one video to obtain an overall feature representation of the video:

wherein the content of the first and second substances,

representing the global characteristics of the jth video,

representing the local features of the jth video,

representing the motion characteristics of the jth video, V ^j Denotes the overall characteristics of the video, j denotes the jth video in the data set, and j is 1,2.

And 3, constructing a video text description network fusing multi-granularity video semantic information, and processing the three video characteristics.

(3.1) constructing a video semantic feature embedding module which is formed by connecting a global semantic information embedding submodule, an action semantic information embedding submodule and a local semantic information embedding submodule in parallel, wherein the structure, the dimensionality and the input and output of each submodule are set as follows:

(3.1.1) a global semantic information embedding submodule is formed by cascading a 1536 x 1000-dimension linear layer and a 1024-dimension bidirectional long and short memory network Bi-LSTM, and the global feature V of (2.3.1) is input to the submodule _a ∈R ^N ^×1536 Outputting global semantic feature V' _a ∈R ^N×1024 ；

(3.1.2) an action semantic information embedding submodule consisting of a linear layer with a dimensionality of 1024 multiplied by 1000 and a Bi-directional long and short memory network Bi-LSTM cascade with a dimensionality of 1024 is input with the action characteristic V (2.3.3) _m ∈R ^N ^×1024 Outputting motion semantic feature V' _m ∈R ^N×1024 ；

(3.1.3) a local semantic information embedding submodule consisting of 2048 × 1000-dimensional linear layers and 1600 × 1000-dimensional cascade connection is input with 2.3.2 local features V _o ∈R ^N×M×2048 Outputting local semantic feature V' _o ∈R ^N×M×1000 ；

(3.2) constructing a video semantic information fusion module which is formed by connecting a noun fusion submodule, a verb fusion submodule and a logic connection word fusion submodule in parallel, and processing three semantic features through the three submodules to realize the following steps:

(3.2.1) a noun fusion submodule is formed by cascading a linear layer with the dimension of 512 x 512 and a dot product attention network, and a noun semantic feature V is calculated ⁿ ：

Wherein

Is the short term memory of the last time step output, V' _o Is the local semantic feature of (3.1.3),

is V' _o Dimension (d);

(3.2.2) a verb fusion submodule is formed by cascading a linear layer with the dimension of 512 multiplied by 512 and a point-product attention network, and verb semantic feature V is obtained through calculation ^l ：

Wherein V' _m Is a semantic feature of an action that,

is V' _m Dimension (d);

(3.2.3) connecting the linear layer with the dimension of 512 multiplied by 512 and the dot product attention network to form a logic connection word fusion submodule, and calculating to obtain the verb semantic feature V ^f ：

Wherein

Is the long-term memory of the last time step,

is that

Of (c) is calculated.

(3.3) constructing a video semantic fusion feature selector module, and processing the three semantic fusion features in (3.2):

(3.3.1) cascading a 1024 × 512-dimensional linear layer, a 512 × 512-dimensional linear layer, an activation function tanh and a softmax function to form a video semantic fusion feature selector module;

(3.3.2) feature V of noun in (3.2) ⁿ Verb feature V ^m And logical conjunction word V ^f Respectively inputting a first linear layer in a video semantic fusion feature selector module to obtain noun embedding, verb embedding and logic connecting word embedding;

(3.3.3) short-term memory of the last time step output

Input video semantic fusion feature selectorA second linear layer in the module, obtaining short-term memory embedding;

(3.3.4) adding the three part-of-speech embedding of (3.3.2) and the short-term memory embedding of (3.3.3) respectively, generating a noun score value, a verb score value and a logical conjunction score value through a tanh activation function, and selecting the largest part-of-speech characteristic of the three score values by using a softmax function;

(3.4) constructing a decoder module which is formed by cascade connection of a long and short memory network neuron LSTMCell with the dimension of 1536, a linear layer with the dimension of 1536 × 7351, a linear layer with the dimension of 512 × 7351 and a softmax function, inputting the part-of-speech characteristic with the maximum score value in (3.3) into the decoder module, and outputting the word number with the maximum possibility in a dictionary;

and (3.5) cascading a video semantic feature embedding module, a semantic information fusion module, a semantic fusion feature selector module and a decoder module to form a video text description network of multi-granularity video semantic information.

And 4, calculating the integral loss of the video text description network by using the cross entropy loss function and the KLD loss function.

(4.1) connecting the word numbers with the highest possibility generated by inputting each training sample into the network to form a predicted video text description;

(4.2) calculating the cross entropy loss of the text according to the following formula:

where Len is the text length, P _t Is a textual description of the network prediction in (5.1),

is the correct text description in the training set;

(4.3) calculating KLD part-of-speech loss of part-of-speech according to the following formula:

wherein

Is a part-of-speech tag that is predicted by the network,

is a correct part-of-speech tag in the training set

(4.4) adding the two losses to obtain the overall loss of the network:

where λ is the set weight parameter.

And 5, training a video text description network fusing multi-granularity video semantic information.

(5.1) extracting three characteristics of the video from the training set according to the same method as the step 2;

(5.2) setting the initial value of the learning rate to be 0.0001 and the training times to be 20, and adjusting the learning rate to be 0.00001 when the training times of the network training reach 10;

(5.3) inputting the three extracted features and correct text descriptions and part-of-speech tags in the training set in (5.1) into the video text description network constructed in the step 3 in batches, wherein the size of each batch of sample pairs is 32, and parameters in the current network are updated iteratively by using an Adam optimization algorithm and a gradient descent method;

(5.4) calculating the overall loss value of the network according to the step 4 and storing the network parameter with the minimum value when each training is finished;

and (5.5) repeating the steps (5.3) and (5.4) until the loss function converges or the training times are finished, so that a trained video text description network is obtained.

And 6, generating text description of the video.

And (3) extracting global features, local features and action features of the video submitted by the user by using three existing neural network pre-training models respectively by using the same method as the step 2 for the video to be described, inputting the features into the trained video text description network, and outputting description sentences corresponding to the video.

The effect of the present invention will be further explained below by combining with simulation experiments.

1. Simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel Core Xeon 4210CPU, the main frequency is 2.2GHz, the internal memory is 16GB, and the display card is Nvidia GeForce RTX 3090.

The software platform of the simulation experiment of the invention is as follows: the linux4.15 operating system and python 3.7.

The simulation experiment of the invention adopts the following data: MSVD dataset comprising 1970 "video-multiple description text" sample pairs, the text description in each sample pair being at least 30. 1200 sample pairs are selected from the MSVD data set as a training set, and the rest sample pairs are used as a test set to verify the accuracy of the network.

The invention selects four existing evaluation indexes: bilingual evaluation accuracy BLUE @4, recall evaluation accuracy ROUGE _ L, explicitly ordered translation evaluation index METEOR and consistency-based image description evaluation index CIDER

2. Simulation content and result analysis thereof:

the simulation experiment of the invention is that the method of the invention and the video text description networks respectively constructed by four prior arts (POS-CG, SAAT, MARN, and texture) are adopted, the five networks are respectively trained by using the same training set data to obtain five trained video text description networks, then the same test set is respectively input into the five trained networks to obtain corresponding video text descriptions, and the output results of the invention are evaluated by using the existing four index methods, as shown in Table 1:

table 1: comparison table of simulation result of the invention and the prior art

The prior art issues for four of table 1 are as follows:

POS-CG is a Video text description Network model, called POS-CG for short, proposed in the paper "Controllable Video capturing With POS Sequence Guidance Based on Gated Fusion Network" (IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (h), 2019, pp.2641-2650, doi: 10.1109/ICCV.2019.00273) published by Wang et al.

SAAT, a Video text description network model, SAAT for short, proposed in the article "Syntax-Aware Action Targeting for Video hosting" by Zheng et al (2020IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA,2020, pp.13093-13102, doi: 10.1109/CVPR42600.2020.01311).

MARN is a Video text description Network model, MARN for short, proposed by Pei et al in its published article "Memory-attached Current Network for Video hosting" (2019IEEE/CVF Conference on Computer Vision and Pattern Registration (CVPR).

Mixture is a Video text description network model, called Mixture for short, proposed by Hou et al in the published article "Joint Syntax replication Learning and Visual Cue transformation for Video Caption" (IEEE/CVF International Conference on Computer Vision (ICCV).

As can be seen from table 1, the four evaluation indexes described by the video text of the method of the present invention are all higher than those of the four prior art methods, i.e., the four evaluation indexes are respectively improved by 8%, 3.7%, 0.4% and 3%.

The above simulation experiments show that: the invention extracts three characteristics of the video through three neural network pre-training models, and outputs the description text by fusing the multi-granularity video semantic information network model, thereby solving the problems that the extracted video characteristics are less in variety and cannot effectively represent the video information in the existing video text description method, and further solving the problems that the video semantic characteristics are not effectively fused in the existing fusion method of various video characteristics, information redundancy exists when various characteristics exist, and the generation of accurate video text is interfered.

Claims

1. A video text description method fusing multi-granularity video semantic information is characterized by comprising the following steps:

(1) establishing a training set:

(1b) for each sentence of text annotation in each pair of samples, marking the part of speech of POS of each sentence of text annotation by using a part of speech tag tool provided in a SpaCy natural language toolkit, wherein each pair of samples is formed into a form of 'video-text-part of speech tag';

(2) three neural network pre-training models are utilized to respectively extract three-level spatial time sequence characteristics of the video sample:

(2c) inputting the key frame images extracted in the step (2a) into an existing trained target detection neural network pre-training model fast-RCNN, extracting M local regions of each image, wherein the characteristic dimension of each region is 2048 dimensions and is used as a local characteristic V _o ∈R ^N×M×2048 Extracting the feature dimension of a local region from each video to be N M*2048；

(2d) Inputting the videos in the training set into the pre-training model I3D of the existing trained 3-dimensional convolutional neural network, taking the time positions of the key frames extracted in the step (2a) as N time points, extracting 1024-dimensional features of each time point in the videos, and taking the 1024-dimensional features as the motion features V of each video along the time sequence _m ∈R ^N×1024 And the action feature dimension extracted from each video is N × 1024.

(3c) constructing a video semantic fusion feature selector module;

And long term memory

(3e) And the video semantic feature embedding module, the semantic information fusion module, the semantic fusion feature selector module and the decoder module are cascaded to form a video text description network of multi-granularity video semantic information.

wherein the content of the first and second substances,

(5) training a video text description network:

(6b) inputting the extracted three features of (6a) into a trained video text description network, and outputting description sentences corresponding to the video.

2. The method of claim 1, wherein: extracting N frames of images from each video in the training set in an equal interval frame sampling mode in the step (2a), wherein the method is realized as follows:

(2a1) sequentially selecting a video from the training set, and calculating the position of each key frame in the video at a specific moment:

wherein T tableShowing the time of the video, V showing the frame rate of the video, N showing the number of intervals set, P _i Represents the ith sampling position, i ═ 0,1,2.. N;

(2a2) using existing OpenCV tools, the N frames of images of the video are extracted from the N specific temporal positions computed in (2a 1).

3. The method of claim 1, wherein: and (3a) the structure and parameter setting of each sub-module in the video semantic feature embedding module are as follows:

the global semantic information embedding submodule is formed by connecting a 1536X 1000-dimensional linear layer and a 1024-dimensional bidirectional long and short memory network Bi-LSTM and is used for outputting a global semantic feature V _a ′；

The action semantic information embedding submodule consists of a linear layer with the dimensionality of 1024 multiplied by 1000 and a bidirectional long and short memory network Bi-LSTM with the dimensionality of 1024 which is used for outputting an action semantic feature V _m ′；

The local semantic information embedding submodule is formed by connecting a linear layer with a dimension of 2048 multiplied by 1000 and a linear layer with a dimension of 1600 multiplied by 1000 and is used for outputting a local semantic feature V _o ′。

4. The method of claim 1, wherein: and (3b) setting the structures and parameters of all sub-modules forming the video semantic information fusion module as follows:

the noun fusion submodule is formed by connecting a linear layer with the dimension of 512 multiplied by 512 and a dot product attention network and is used for outputting a noun semantic feature V ⁿ ：

Wherein

Is the last oneShort-term memory of time-step output, V _o ' is a local semantic feature that is,

is a V _o The dimension of';

the verb fusion submodule is formed by connecting a linear layer with dimension of 512 multiplied by 512 and a dot-product attention network and is used for outputting verb semantic features V ^l ：

Wherein V _m ' is a semantic feature of the action,

is V _m The dimension of';

the logic connection word fusion submodule is formed by connecting a linear layer with the dimension of 512 multiplied by 512 and a dot product attention network and is used for outputting verb semantic features V ^f ：

Wherein

Is the long-term memory of the last time step,

is that

Of (c) is calculated.

5. The method of claim 1, wherein: cross entropy loss function in step (4)

And KLD loss function

Respectively, as follows:

where Len is the text length, P _t Is a textual description of the prediction of the network,

is a correct text description in the training set,

is a part-of-speech tag predicted by the network,

is the correct part-of-speech tag in the training set.

6. The method of claim 1, wherein: and (5b) iteratively updating parameters in the current network by using an Adam optimization algorithm and a gradient descent method, wherein the method is realized as follows:

(5b1) setting the initial value of the learning rate to be 0.0001 and the training times to be 20, and adjusting the learning rate to be 0.00001 when the training times of the network training reach 10;

(5b2) in each training, selecting sample pairs from a training set in batches, setting the size of each batch of sample pairs to be 32, inputting the sample pairs into a network, calculating the overall loss of the network, and iteratively updating network parameters through back propagation;

(5b3) storing the network parameter with the minimum loss function after each training;

(5b4) and repeating (5b2) and (5b3) until the loss function converges or the training times are finished, so as to obtain the trained video text description network.