CN114913448A

CN114913448A - Video understanding method, device, equipment, storage medium and computer program product

Info

Publication number: CN114913448A
Application number: CN202210242033.1A
Authority: CN
Inventors: 全绍军; 林格; 陈小燕; 梁少玲
Original assignee: Longse Technology Co ltd
Current assignee: Longse Technology Co ltd
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-08-16

Abstract

The application relates to the technical field of videos and provides a video understanding method, a video understanding device, computer equipment, a storage medium and a computer program product. The method and the device can improve the efficiency and accuracy of video understanding. The method comprises the following steps: the method comprises the steps of obtaining a video to be understood, obtaining text characteristic information, dynamic characteristic information and static characteristic information of the video to be understood respectively by utilizing a text characteristic obtaining network, a dynamic characteristic obtaining network and a static characteristic obtaining network, and obtaining an understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

Description

Video understanding method, device, equipment, storage medium and computer program product

Technical Field

The present application relates to the field of video technologies, and in particular, to a video understanding method, apparatus, computer device, storage medium, and computer program product.

Background

With the explosion of natural language processing and the field of computer vision, video understanding becomes a new hotspot after image understanding. Efficient video understanding of videos obtained in real time can greatly contribute to improving security problems.

The conventional technology is generally to perform video understanding by human, but the video understanding by the technology is inefficient.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video understanding method, apparatus, computer device, computer readable storage medium and computer program product for solving the above technical problems.

In a first aspect, the present application provides a video understanding method. The method comprises the following steps:

acquiring a video to be understood;

respectively acquiring text characteristic information, dynamic characteristic information and static characteristic information of a video to be understood by utilizing a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network;

and acquiring an understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

In one embodiment, acquiring an understanding result of a video to be understood based on text feature information, dynamic feature information and static feature information includes:

performing guiding attention fusion on the text characteristic information, the dynamic characteristic information and the static characteristic information to obtain a plurality of modal characterization information;

acquiring aggregated feature information of a plurality of modal characterization information by using a graph attention network;

and acquiring an understanding result of the video to be understood according to the aggregation characteristic information.

In one embodiment, acquiring an understanding result of a video to be understood according to the aggregated feature information includes:

utilizing a time sequence memory updating network to carry out time sequence memory updating on the aggregation characteristic information;

and acquiring an understanding result of the video to be understood according to the updated aggregation characteristic information memorized and memorized in time sequence.

In one embodiment, the method further comprises:

acquiring a video sample carrying text information and an understanding result label corresponding to the video sample;

and training a text feature acquisition network, a dynamic feature acquisition network, a static feature acquisition network, a graph attention network and a time sequence memory updating network by using the video sample and the understanding result label.

In one embodiment, acquiring text feature information of a video to be understood by using a text feature acquisition network includes:

detecting whether a video to be understood carries corresponding text information;

if not, generating corresponding text information according to the video to be understood by using a text information generation model;

and acquiring the text characteristic information of the video to be understood by utilizing the text characteristic acquisition network according to the corresponding text information.

In one embodiment, the method for respectively acquiring the text feature information, the dynamic feature information and the static feature information of the video to be understood by using the text feature acquisition network, the dynamic feature acquisition network and the static feature acquisition network includes:

respectively acquiring to-be-processed text characteristic information, to-be-processed dynamic characteristic information and to-be-processed static characteristic information of a video to be understood by using a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network;

and respectively carrying out context coding on the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed by utilizing the context information acquisition network to obtain the text characteristic information, the dynamic characteristic information and the static characteristic information.

In a second aspect, the present application also provides a video understanding apparatus. The device comprises:

the to-be-understood video acquisition module is used for acquiring a to-be-understood video;

the characteristic information acquisition module is used for respectively acquiring the text characteristic information, the dynamic characteristic information and the static characteristic information of the video to be understood by utilizing a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network;

and the understanding result acquisition module is used for acquiring the understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring a video to be understood; respectively acquiring text characteristic information, dynamic characteristic information and static characteristic information of a video to be understood by utilizing a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network; and acquiring an understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

The video understanding method, the video understanding device, the computer equipment, the storage medium and the computer program product acquire the video to be understood, respectively acquire the text characteristic information, the dynamic characteristic information and the static characteristic information of the video to be understood by using the text characteristic acquisition network, the dynamic characteristic acquisition network and the static characteristic acquisition network, and acquire the understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information. According to the scheme, the video to be understood is obtained, the video to be understood is input into the video understanding model, the text characteristic obtaining network, the dynamic characteristic obtaining network and the static characteristic obtaining network in the video understanding model are utilized to respectively obtain the text characteristic information, the dynamic characteristic information and the static characteristic information of the video to be understood, the understanding result of the video to be understood is obtained based on the text characteristic information, the dynamic characteristic information and the static characteristic information, the video understanding efficiency is improved, further, the text characteristic obtaining network, the dynamic characteristic obtaining network and the static characteristic obtaining network are utilized to jointly obtain the characteristic information of the video to be understood, and the accuracy of video understanding can be improved.

Drawings

FIG. 1 is a flow diagram illustrating a video understanding method in one embodiment;

FIG. 2 is a schematic diagram of a video understanding model in one embodiment;

FIG. 3 is a flow diagram illustrating construction of a video understanding dataset according to one embodiment;

FIG. 4 is a flow chart illustrating a video understanding method according to another embodiment;

FIG. 5 is a block diagram of a video understanding apparatus in one embodiment;

FIG. 6 is a diagram of the internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a video understanding method is provided, and this embodiment is exemplified by applying the method to a terminal, and includes the following steps:

and step S101, acquiring a video to be understood.

In this step, the video to be understood may be a real-time event-specific video.

Specifically, the terminal acquires a video to be understood.

Step S102, respectively acquiring text characteristic information, dynamic characteristic information and static characteristic information of the video to be understood by utilizing a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network.

In this step, as shown in fig. 2, the text feature obtaining network (which may also be referred to as a caption text feature extraction network) may be a network model for obtaining text feature information of a video to be understood, such as a BERT network model; the text feature information may be feature information obtained by word embedding subtitle text information of a Video to be understood by a BERT network model, where as shown in fig. 2 and 3, the subtitle text information may be subtitle description information obtained by manually summarizing the Video to be understood (for example, each Video segment is summarized by a specific event worker to obtain a piece of subtitle description information, for example, "XXXX specifically"), the subtitle description information may be stored in a document corresponding to the Video to be understood, the subtitle text information may also be corresponding text information generated according to the Video to be understood by using a text information generation model, where the text information generation model may be an MDVC model (Multi-modal depth Video capturing); the dynamic feature acquisition network may be a network model for extracting dynamic feature information of the video to be understood, such as a C3D network; the static feature acquisition network may be a network model for extracting static feature information of the video to be understood, such as a VGG16 network model.

Specifically, as shown in fig. 2, the terminal respectively acquires text feature information, dynamic feature information, and static feature information of the video to be understood by using a text feature acquisition network, a dynamic feature acquisition network, and a static feature acquisition network.

And step S103, acquiring an understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

In this step, the understanding result may be which type of the preset video to be understood belongs to.

Specifically, the terminal obtains an understanding result of the video to be understood based on the text characteristic information, the dynamic characteristic information and the static characteristic information.

In the video understanding method, a video to be understood is acquired, text feature information, dynamic feature information and static feature information of the video to be understood are acquired respectively by using a text feature acquisition network, a dynamic feature acquisition network and a static feature acquisition network, and an understanding result of the video to be understood is acquired based on the text feature information, the dynamic feature information and the static feature information. According to the scheme, the video to be understood is obtained, the video to be understood is input into the video understanding model, the text characteristic obtaining network, the dynamic characteristic obtaining network and the static characteristic obtaining network in the video understanding model are utilized to respectively obtain the text characteristic information, the dynamic characteristic information and the static characteristic information of the video to be understood, the understanding result of the video to be understood is obtained based on the text characteristic information, the dynamic characteristic information and the static characteristic information, the video understanding efficiency is improved, further, the text characteristic obtaining network, the dynamic characteristic obtaining network and the static characteristic obtaining network are utilized to jointly obtain the characteristic information of the video to be understood, and the accuracy of video understanding can be improved.

In an embodiment, the obtaining the text feature information, the dynamic feature information, and the static feature information of the video to be understood by using the text feature obtaining network, the dynamic feature obtaining network, and the static feature obtaining network in step S102 specifically includes: respectively acquiring to-be-processed text characteristic information, to-be-processed dynamic characteristic information and to-be-processed static characteristic information of a video to be understood by using a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network; and respectively carrying out context coding on the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed by utilizing the context information acquisition network to obtain the text characteristic information, the dynamic characteristic information and the static characteristic information.

In this embodiment, the context information obtaining network may be a BilSTM.

Specifically, the terminal respectively acquires the to-be-processed text characteristic information, the to-be-processed dynamic characteristic information and the to-be-processed static characteristic information of the to-be-understood video by using the text characteristic acquisition network, the dynamic characteristic acquisition network and the static characteristic acquisition network, and respectively performs context coding on the to-be-processed text characteristic information, the to-be-processed dynamic characteristic information and the to-be-processed static characteristic information by using the context information acquisition network to obtain the text characteristic information, the dynamic characteristic information and the static characteristic information.

Illustratively, as shown in fig. 2, the terminal uses BERT (text feature acquisition network) to perform word embedding on subtitle information in a video, and uses BiLSTM (context information acquisition network) to perform context coding and generate feature vectors, and uses 1-dimensional convolution and maximum pooling to form final feature vectors (text feature information) and corresponding s vectors, and then uses VGG16 (static feature acquisition network) and C3D (dynamic feature acquisition network) to extract static feature information and dynamic feature information (equivalent to static feature information to be processed and dynamic feature information to be processed) in the video, and uses BiLSTM to perform feature coding, so as to form a vectors and m vectors (static feature information and dynamic feature information). Specifically, as shown in fig. 2, the mutual guidance feature extraction module of the video understanding model includes a video dynamic feature extraction module, a video static feature extraction module, and a subtitle text feature extraction module. Firstly, a video dynamic feature extraction module: terminal extracting dynamic characteristics of video through C3D networkThe actual input of the video obtains the dynamic characteristics of each 16 frames, namely 16 × N, and meanwhile, in order to align the context information of different modes of the video, the sliding window sampling operation can be carried out on the time dimension to ensure that each frame contains the dynamic characteristic information, and the dynamic characteristics m of each frame are obtained from the full-connection layer of the last layer _i Wherein m is _i (i-1, 2, … …, N) is the ith video motion feature, and the video motion feature U is obtained ^m ＝[m ₁ ，m ₂ ，……，m _N ]∈R ^4096×N Wherein N is the frame number of the video, and in order to ensure the context information of the dynamic features of the video, the dynamic features of the video are coded h by using a BiLSTM (context information acquisition network) with the same dimension _i ^m ＝BiLSTM(m _i ) Obtaining the dynamic characteristics (dynamic characteristic information) U of the video after coding ^m ＝[h ₁ ^m ，h ₂ ^m ，……，h _N ^m ]∈R ^4096×N Wherein h is _i ^m The feature vector after the motion feature coding of the ith video is obtained, N is the frame number of the video, and m represents the motion feature of the video. A video static feature extraction module: extracting the characteristics of the video static frame by adopting a VGG16 network model (static characteristic acquisition network), extracting the characteristics of the video static frame by using 1FPS (field programmable gate array) in order to synchronize the context relations of different modalities in the video and accurately deduce the evolution information of the static characteristics of the video frame in a video time domain, and obtaining the static characteristics a of the video static frame from a full connection layer of a second last layer _i Wherein a is _i (i-1, 2, … …, N) is the ith video static feature, resulting in video static feature U ^a ＝[a ₁ ，a ₂ ，……，a _N ]∈R ^4096×N Wherein N is the frame number of the video, R represents the space dimension, 4096 XN represents the size of the space dimension of R, and in order to obtain the context information of the static feature of the video, the static feature of the video is coded by using the BiLSTM with the same dimension h _i ^a ＝BiLSTM(a _i ) Obtaining the dynamic characteristics (static characteristic information) U of the video after the coding ^a ＝[h ₁ ^a ，h ₂ ^a ，……，h _N ^a ]∈R ^4096×N Wherein h is _i ^a And (4) encoding the feature vector for the ith video static feature, wherein N is the frame number of the video, and a represents the video static feature. The subtitle text feature extraction module: adopting a BERT network model (text feature acquisition network) to extract caption text features, selecting a 12-layer BERT network model to extract caption text features (equivalent to text feature information to be processed), and obtaining caption text features s from the second last layer of the BERT network model _i Wherein s is _i (i is 1, 2, … …, N) is the ith subtitle text feature, and the video subtitle text feature U is obtained ^s ＝[s ₁ ，s ₂ ，……，s _N ]∈R ^768×N×L N is the frame number of the video, L is the number of the words in the sentence, because BERT belongs to word embedding, and simultaneously, in order to obtain the context information of the text, the BilSTM with the same dimension is used for coding the text characteristics of the caption h _i ^s ＝BiLSTM(s _i ) Obtaining the character U of the caption text after coding ^s ＝[h ₁ ^s ，h ₂ ^s ，……，h _N ^s ]∈R ^768×N×L Wherein h is _i ^s The feature vector after the ith caption text feature coding is carried out, N is the frame number of the video, L is the sentence number, s represents the video caption text feature, and finally (conv1-ReLu-maxpool) is adopted to obtain the final text feature (text feature information) U ^s ＝[h ₁ ^s ，h ₂ ^s ，……，h _N ^s ]∈R ^1024×N 。

According to the technical scheme of the embodiment, the context information of the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed is obtained by respectively performing context coding on the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed by utilizing the context information obtaining network, so that more accurate text characteristic information, dynamic characteristic information and static characteristic information can be obtained, and the accuracy of video understanding is improved.

In an embodiment, the acquiring, based on the text feature information, the dynamic feature information, and the static feature information in step S103, an understanding result of the video to be understood specifically includes: performing guiding attention fusion on the text characteristic information, the dynamic characteristic information and the static characteristic information to obtain a plurality of modal characterization information; acquiring aggregated feature information of a plurality of modal characterization information by using a graph attention network; and acquiring an understanding result of the video to be understood according to the aggregation characteristic information.

In this embodiment, the guiding attention fusion may be performing attention weight analysis on the text feature information, the dynamic feature information, and the static feature information; the modal characterization information may be a feature vector most relevant to the problem selected after performing attention weight analysis; the graph attention network may be a graph attention network GAT; the aggregated feature information may be aggregated feature information of different modalities obtained through training of the attention network.

Specifically, the terminal guides attention to and fuses text characteristic information, dynamic characteristic information and static characteristic information to obtain a plurality of modal characterization information, obtains aggregation characteristic information of the plurality of modal characterization information by using a graph attention network, and obtains an understanding result of the video to be understood according to the aggregation characteristic information.

Illustratively, as shown in fig. 2, three attention-directing modules are designed in order to obtain relevant modal characteristics. Because the soft attention mechanism has selectivity and differentiability, the soft attention mechanism is adopted to design the attention guiding module, meanwhile, the correlation matrix is adopted to fuse, and the fused features are connected. First Attention directing module a2m-Attention, taking N as iteration condition, the terminal will static feature vector h ₁ ^a And the row vector h in the video dynamic characteristic matrix ₁ ^m As the input of the guiding Attention module a2m-Attention, wherein i represents the ith understanding pair, i represents the ith line dynamic feature vector in the video dynamic feature matrix, and the guiding Attention model is represented as Soft _ Attention (h) _i ^a ，h _i ^m ) Then, learning of problem guide attention weight is carried out on the dynamic features of the video, and features are generated by using the attention weight

And

and connecting the generated features

Then, linear and tanh functions are used for dimensionality reduction, and the attention guiding mechanism is disclosed as follows

T is the transpose of the vector, co-embedded in d _ma ×d _a In space of dimension, h _j ^e And m _i ^e Are respectively a _i And h _i ^m Is embedded in the representation, [,]for connecting operation, attention is paid to the video dynamic feature through the probability after problem guide attention to obtain a feature vector after fusion, attention is paid to static feature representation in order to improve attention weight of different modes, and the feature vector is connected, wherein the microcosmic property of an attention guide mechanism enables the feature vector to have learnability, finally linear and nonlinear transformation is carried out to obtain a final feature vector, attention is paid to the video dynamic feature matrix (which can also be N multiplied by 250) after attention is paid to, dimension reduction is carried out on each line of the video dynamic feature matrix by utilizing maximum pooling to obtain a final feature matrix U ^m ＝[u ₁ ^m ，u ₂ ^m ，……，u _N ^m ]∈R ^512×N Where N is the number of frames in the video. Secondly, the second Attention guiding module also adopts the same soft Attention mechanism to carry out Attention calculation on the caption text characteristics, and the second Attention guiding module s2a-Attention takes N as an iteration condition to carry out problem characteristic vector h _i ^a And the line vector h in the video caption text characteristic matrix _i ^s As an input of the guiding Attention module a2s-Attention, where i denotes the i-th understanding pair, i denotes the i-th line of the caption text feature vector in the caption text feature matrix, and the guiding Attention model is denoted as Soft _ Attention (h) _i ^a ，h _i ^s ) Then is aligned withLearning of question-oriented attention weight using caption text features, and feature generation using attention weight

And

and connecting the generated features

Then using linear and tanh functions to perform dimensionality reduction to obtain a final feature vector, sequentially paying attention by taking N as an iteration condition to obtain a caption text feature matrix after paying attention, and performing dimensionality reduction on each line of the caption text feature matrix by utilizing maximum pooling to obtain a final feature matrix U of the caption text feature matrix ^s ＝[u ₁ ^s ，u ₂ ^s ，……，u _N ^s ]∈R ^512×N Where N is the frame number of the video. Thirdly, the third Attention leading module also adopts the same soft Attention mechanism to carry out Attention calculation on the static characteristics of the video, and the third Attention leading module m2a-Attention takes N as an iteration condition to carry out Attention calculation on the problem characteristic vector h _i ^m And the row vector h in the video static feature matrix _i ^a As an input to the guided Attention module m2a-Attention, where i denotes the ith understanding pair, i denotes the ith row static feature vector in the static feature matrix, and the guided Attention model is denoted Soft _ Attention (h) _i ^m ，h _i ^a ) Then, learning of problem guide attention weight is carried out on the static features of the video, and features are generated by using the attention weight

And with

And connecting the generated features

Then using linear and tanh functionsPerforming dimensionality reduction to obtain a final feature vector, performing attention in sequence by taking N as an iteration condition to obtain a video static feature matrix after attention, and performing dimensionality reduction on each line of the video static feature matrix by utilizing maximum pooling to obtain a final feature matrix U ^a ＝[u ₁ ^a ，u ₂ ^a ，……，u _N ^a ]∈R ^512×N Where N is the frame number of the video. After passing through the three attention modules, the row vectors of different modal feature matrices are respectively encoded by using 150-dimensional BiLSTM and taking N as an iteration condition, and different modal feature matrices U are respectively obtained ^a ＝[u ₁ ^a ，u ₂ ^a ，……，u _N ^a ]∈R ^150×N ，U ^m ＝[u ₁ ^m ，u ₂ ^m ，……，u _N ^m ]∈R ^150×N ，U ^s ＝[u ₁ ^s ，u ₂ ^s ，……，u _N ^s ]∈R ^150×N (corresponding to a plurality of modality characterizing information).

Next, as shown in fig. 2, the terminal inputs the obtained plurality of modal characterization information to the feature enhancement memory module. The multiple modes can improve the prior of video understanding, the semantic relation among different modes can obviously improve the reasoning capability of a video understanding model, the semantic relation among the different modes can be weakened through simple operations such as vector addition, multiplication and the like, in order to accurately model the semantic relation among the different modes, a graph attention network (GAT) is adopted to model the semantic relation among the different modes, the aggregation characteristics of the different modes are obtained through the training of the graph attention network, and simultaneously, in order to model the context time sequence relation, the time sequence relation in a video is modeled by a BilSTM (time sequence memory update network). To establish a link between modality information, a undirected fully connected graph G is first defined _i ＝{V _i ,E ^attention Therein of

Is the set of nodes of the graph attention network at the ith iteration, where type represents the pairThe corresponding node category respectively represents the video dynamic characteristic, the video static characteristic and the video caption text characteristic, the node represents the node number, and each iteration has three vertexes v _i ^a ，v _i ^m ，v _i ^s The content of each node is a feature vector of different modes, which is u respectively _i ^a ，u _i ^c ，u _i ^m Row vectors of different modal feature matrices corresponding to the ith iteration, E ^attention The method is characterized in that an edge set among nodes in a graph attention network is represented as an attention cross-correlation coefficient among different nodes, a modal object graph can be constructed, a graph attention layer is utilized to obtain mutually enhanced characteristics among different modes, a two-layer graph attention layer can be constructed, and a formula for each layer is

Wherein

The features after attention is paid to for the z-th layer,

the method is characterized in that the method is an adjacent matrix, because an all-connected undirected graph is constructed, the adjacent matrix is initialized to be the corresponding all-connected undirected graph before graph training each time, the adjacent matrix is updated by using attention cross correlation coefficients in the training process to form an attention coefficient matrix, and the adjacent matrix updating formula is as follows

The matrix is a learnable weight matrix, alpha is a cross-correlation coefficient matrix, and the calculation process is that firstly, an input feature vector passes through a self-attention mechanism sa:

obtaining a sharable weight matrix:

then utilizing LeakyReLU activation function to carry out nonlinear calculation, finally utilizing softmax to carry out regularization, and finally calculating a cross-correlation coefficient matrix, wherein the calculation formula is

Wherein softmax is performed in each row, wherein each node is subjected to a self-attention and multi-head attention calculation formula (point-by-point summation operation) of

Wherein K is the number of multi-head attention and can be set to be 3, sigma represents a sigmod activation function, r represents a neighbor node of a node h, and a graph attention layer of a second layer can be used as an output and is defined as

Obtaining graph-embedded representation after self-attention and multi-head attention calculations of different modal characteristics over a graph attention network

Then embedding the graph, reducing the dimension by 1-dimensional convolution, and obtaining the most relevant characteristic o by utilizing maximum pooling _i ∈R ^1×150 In order to perform context correlation analysis and time sequence memory on different modal characteristics of a video simultaneously, so that the model has the capability of multi-step reasoning, a two-way time sequence memory network BilSTM (time sequence memory updating network, also called as a two-way recurrent neural network) is adopted as a memory updating unit in the model, multi-modal characteristics at different moments are subjected to memory updating, the same-dimension BilSTM is adopted to encode oi, N iterations can be performed on the time sequence relation of each video, only the final output of the BilSTM is selected, namely when t is N, the output h of the BilSTM is performed again _j ＝BiLSTM(o _i ) _t＝N Finally, the final modal vector is obtained after passing through a full connection layer

Wherein j is the number of each understanding pair, and finally summarizing each candidate answer to obtain the final modal characteristics

And (6) performing prediction. Then, the terminal can define the video understanding task as a generation understanding task, firstly, the output after the fusion inference is subjected to connection operation, then, the full connection layer (FC) is used for linear and nonlinear transformation, and finally, the cross entropy loss function is used for training. In particular, a terminal may be defined to generate an understanding task by first characterizing features using softmax

Converting into prediction fraction, and optimizing by using cross entropy loss function

Wherein, y _GT Indicating GT true, k indicating the kth prediction and N indicating the sample. And finally, the terminal selects the maximum prediction score from the final prediction scores by using a prediction module as a final result, and can utilize a Softmax normalization layer to calculate the formula: pro ═ softmax (P) ^q ) The prediction score is obtained, and finally the highest score is selected as the final prediction, and y ═ max (pro) is used. The terminal can also be used for realizing intelligent early warning and real-time detection of the current specific event system according to the understanding result of the video to be understood by utilizing the video understanding model.

According to the technical scheme of the embodiment, the aggregate characteristic information of the plurality of modal characterization information obtained after the guiding attention fusion is obtained by using the graph attention network, so that the understanding result of the video to be understood is obtained, and the accuracy of video understanding is improved.

In an embodiment, the method may further include the following steps of obtaining an understanding result of the video to be understood, specifically including: utilizing a time sequence memory updating network to carry out time sequence memory updating on the aggregation characteristic information; and acquiring an understanding result of the video to be understood according to the updated aggregation characteristic information of the time sequence memory.

In this embodiment, as shown in fig. 2, the timing memory update network may be a bidirectional timing memory network BiLSTM (also called bidirectional recurrent neural network).

Specifically, the terminal performs time sequence memory updating on the aggregation characteristic information by using a time sequence memory updating network, and acquires an understanding result of the video to be understood according to the updated aggregation characteristic information of the time sequence memory.

According to the technical scheme of the embodiment, the understanding result of the video to be understood is obtained according to the aggregation characteristic information updated by the time sequence memory, so that the accuracy of video understanding is improved.

In an embodiment, the method may further train a text feature acquisition network, a dynamic feature acquisition network, a static feature acquisition network, a graph attention network, and a time-series memory update network by the following steps, specifically including: acquiring a video sample carrying text information and an understanding result label corresponding to the video sample; and training a text feature acquisition network, a dynamic feature acquisition network, a static feature acquisition network, a graph attention network and a time sequence memory updating network by using the video sample and the understanding result label.

In this embodiment, as shown in fig. 3, the video sample carrying the text information may be a video sample carrying subtitle description information obtained by manually summarizing the video to be understood; the understanding result label corresponding to the video sample may be a real understanding result label corresponding to the video sample, for example, the video sample is a label belonging to which specific event type is preset.

Specifically, the terminal acquires a video sample carrying text information and an understanding result label corresponding to the video sample, and trains a text feature acquisition network, a dynamic feature acquisition network, a static feature acquisition network, a graph attention network and a time sequence memory updating network by using the video sample and the understanding result label.

Illustratively, as shown in fig. 3 and 4, a large number of videos in a specific event system can be used to construct a data set for understanding related videos, and a specific event video understanding model of a mutual-guidance graph attention network is designed, and meanwhile, the purposes of intelligent early warning and real-time detection of the current specific event system are achieved. Specifically, S1, inputting a large number of event-specific videos in the event-specific system, and outputting a video understanding dataset; as shown in fig. 3, the terminal first uses a large number of existing specific event videos to construct a video understanding related data set JQYJ-video data, selects 400 videos as original video data, where each video is 1 minute, and splits each video into 10-second video segments (i.e., splits each video into 6 video segments), each video segment is summarized by specific event staff to obtain a segment of caption description information (i.e., text information), 2400 (i.e., 400 × 6) specific event video segments and caption description files are obtained, the corresponding professionals classify specific event videos into specific event types and divide the specific event types into five preset specific event types (i.e., the video samples have corresponding understanding result labels), until each preset specific event type has 2400 video samples, obtain 12000 (i.e., 2400 × 5) video samples (each video sample already carries corresponding text information), 12000 video samples can be divided into a training set (which can contain 9000 video samples), a verification set (which can contain 1000 video samples) and a test set (which can contain 2000 video samples), and then the terminal acquires the 12000 video samples carrying text information and understanding result labels corresponding to the video samples. S2, inputting a video understanding data set formed in S1, and outputting a pre-training model (namely a video understanding model); as shown in fig. 2, the pre-training model may include three modules: a modal feature extraction module (which can be simply referred to as a mutual guidance feature extraction module and can be divided into a modal feature extraction module and a mutual guidance attention module), a feature enhancement memory module (which can be simply referred to as a feature enhancement module) and a prediction module, wherein the modal feature extraction module is mainly used for extracting multi-modal features, namely, word embedding is carried out on subtitle text information by adopting a trained BERT network model (text feature acquisition network), context coding is carried out by utilizing a BilSTM (context information acquisition network) to generate features, video dynamic features are extracted by adopting a C3D (dynamic feature acquisition network), context coding is carried out by utilizing the BilTM (context information acquisition network) to generate features, static features of a video are extracted by adopting a VGG16 network model (static feature acquisition network), context coding is carried out by utilizing the BilTM (context information acquisition network) to generate features, the method comprises the steps that a mutual guidance attention module mainly plays a role in obtaining most relevant feature information, a feature enhancement memory module mainly plays a role in utilizing a graph attention network (GAT) to model an inference mechanism of a video understanding task, namely, the graph attention network is utilized to model semantic relations of different modal features, a BiLSTM (time sequence memory update network) is utilized to update time sequence memory, a prediction module mainly plays a role in predicting answers, a terminal inputs a video sample obtained in S1 and an understanding result label into a pre-training model, and the pre-training model is trained, namely, a text feature obtaining network, a dynamic feature obtaining network, a static feature obtaining network, a graph attention network and a time sequence memory update network in the pre-training model are trained. And S3, inputting the video of the specific event and the pre-training model S2, and outputting the detection data of the pre-training model for understanding the video. The terminal acquires a video to be understood (such as a specific event video), inputs the video into the pre-training model, and outputs an understanding result of the video to be understood (such as video understanding pre-training model detection data).

According to the technical scheme of the embodiment, the text feature acquisition network, the dynamic feature acquisition network, the static feature acquisition network, the graph attention network and the time sequence memory updating network are trained by utilizing the video sample and the understanding result label, so that the accuracy of each network model is favorably improved, and the accuracy of video understanding is favorably improved.

In an embodiment, the acquiring, by the text feature acquisition network, text feature information of the video to be understood in step S102 specifically includes: detecting whether a video to be understood carries corresponding text information; if not, generating corresponding text information according to the video to be understood by using a text information generation model; and acquiring the text characteristic information of the video to be understood by utilizing the text characteristic acquisition network according to the corresponding text information.

In this embodiment, the text information generation model may be an MDVC model (Multi-modal depth Video capturing).

Specifically, the terminal detects whether the video to be understood carries corresponding text information, if not, the text information generation model is used for generating corresponding text information according to the video to be understood, the text characteristics are used for acquiring a network, and the text characteristic information of the video to be understood is acquired according to the corresponding text information.

Illustratively, the terminal detects whether a video to be understood carries corresponding text information, if not, the terminal generates dense caption information of the video to be understood by using a text information generation model (MDVC model), and extracts text characteristic information of the dense caption information by using a text characteristic acquisition network.

According to the technical scheme, the text information generation model is used for generating the corresponding text information according to the video to be understood, the text characteristic acquisition network is used for acquiring the text characteristic information of the video to be understood according to the corresponding text information, the text characteristic information of the video to be understood can be acquired when the video to be understood does not carry the corresponding text information, and therefore the accuracy of video understanding is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a video understanding apparatus for implementing the above-mentioned video understanding method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the video understanding apparatus provided below may refer to the limitations on the video understanding method in the foregoing, and details are not described herein again.

In one embodiment, as shown in fig. 5, a video understanding apparatus is provided, and the apparatus 500 may include:

a to-be-understood video obtaining module 501, configured to obtain a to-be-understood video;

a feature information obtaining module 502, configured to obtain text feature information, dynamic feature information, and static feature information of the video to be understood respectively by using a text feature obtaining network, a dynamic feature obtaining network, and a static feature obtaining network;

an understanding result obtaining module 503, configured to obtain an understanding result of the video to be understood based on the text feature information, the dynamic feature information, and the static feature information.

In an embodiment, the understanding result obtaining module 503 is further configured to perform attention-directing fusion on the text feature information, the dynamic feature information, and the static feature information to obtain a plurality of modal characterization information; acquiring aggregated feature information of the plurality of modal characterization information by using a graph attention network; and acquiring an understanding result of the video to be understood according to the aggregation characteristic information.

In one embodiment, the understanding result obtaining module 503 is further configured to perform a time series memory update on the aggregated feature information by using a time series memory update network; and acquiring an understanding result of the video to be understood according to the updated aggregation characteristic information of the time sequence memory.

In one embodiment, the understanding result obtaining module 503 is further configured to obtain a video sample carrying text information and an understanding result tag corresponding to the video sample; and training the text feature acquisition network, the dynamic feature acquisition network, the static feature acquisition network, the graph attention network and the time sequence memory updating network by using the video sample and the understanding result label.

In one embodiment, the to-be-understood video obtaining module 501 is further configured to detect whether the to-be-understood video carries corresponding text information; if not, generating corresponding text information according to the video to be understood by using a text information generation model; and acquiring the text characteristic information of the video to be understood by utilizing the text characteristic acquisition network according to the corresponding text information.

In an embodiment, the to-be-understood video obtaining module 501 is further configured to obtain to-be-processed text feature information, to-be-processed dynamic feature information, and to-be-processed static feature information of the to-be-understood video respectively by using the text feature obtaining network, the dynamic feature obtaining network, and the static feature obtaining network; and respectively carrying out context coding on the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed by utilizing a context information acquisition network to obtain the text characteristic information, the dynamic characteristic information and the static characteristic information.

The various modules in the video understanding apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video understanding method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring a video to be understood;

respectively acquiring text characteristic information, dynamic characteristic information and static characteristic information of a video to be understood by using a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

and acquiring an understanding result of the video to be understood according to the updated aggregation characteristic information of the time sequence memory.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of:

acquiring a video to be understood;

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

acquiring a video to be understood;

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for video understanding, the method comprising:

acquiring a video to be understood;

respectively acquiring text characteristic information, dynamic characteristic information and static characteristic information of the video to be understood by utilizing a text characteristic acquisition network, a dynamic characteristic acquisition network and a static characteristic acquisition network;

2. The method according to claim 1, wherein the obtaining an understanding result of the video to be understood based on the text feature information, the dynamic feature information, and the static feature information comprises:

conducting guiding attention fusion on the text characteristic information, the dynamic characteristic information and the static characteristic information to obtain a plurality of modal characterization information;

acquiring aggregated feature information of the plurality of modal characterization information by using a graph attention network;

3. The method according to claim 2, wherein the obtaining an understanding result of the video to be understood according to the aggregated feature information includes:

4. The method of claim 3, further comprising:

and training the text feature acquisition network, the dynamic feature acquisition network, the static feature acquisition network, the graph attention network and the time sequence memory updating network by using the video sample and the understanding result label.

5. The method according to claim 1, wherein the acquiring the text feature information of the video to be understood by using the text feature acquisition network comprises:

detecting whether the video to be understood carries corresponding text information or not;

6. The method according to claim 1, wherein the acquiring the text feature information, the dynamic feature information, and the static feature information of the video to be understood by using the text feature acquisition network, the dynamic feature acquisition network, and the static feature acquisition network respectively comprises:

respectively acquiring to-be-processed text characteristic information, to-be-processed dynamic characteristic information and to-be-processed static characteristic information of the to-be-understood video by using the text characteristic acquisition network, the dynamic characteristic acquisition network and the static characteristic acquisition network;

and respectively carrying out context coding on the text characteristic information to be processed, the dynamic characteristic information to be processed and the static characteristic information to be processed by utilizing a context information acquisition network to obtain the text characteristic information, the dynamic characteristic information and the static characteristic information.

7. A video understanding apparatus, characterized in that the apparatus comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.