CN113392686A

CN113392686A - Video analysis method, device and storage medium

Info

Publication number: CN113392686A
Application number: CN202011073795.0A
Authority: CN
Inventors: 单瀛; 蔡佳音; 袁春
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-09-14

Abstract

The application relates to a video analysis method, a video analysis device and a storage medium, wherein the video analysis method comprises the following steps: acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed; determining at least one video characteristic information corresponding to a video to be analyzed; determining problem characteristic information corresponding to a problem to be solved; inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information; according to the first target characteristic information and the question characteristic information, answer information corresponding to the question to be answered is determined, so that when the video is subjected to semantic understanding analysis, the video can be subjected to targeted memory by taking the question as guidance, and the memory effect of the long-term video is improved.

Description

Video analysis method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video analysis method, apparatus, and storage medium.

Background

Video question answering (VideoQA) aims at making high-level inferences about the spatio-temporal content of a video and inferring the correct answer to a given video-related question depicted in natural language.

At present, the technical scheme adopted for the video question-answering task is to extract a video expression vector by using a trained deep learning model, then fuse and memorize the characteristics of two modes, namely a video mode and a question mode, through an attention mechanism or a memory model, and finally generate an answer through a classifier.

However, the conventional memory module can memorize a large amount of video information irrelevant to the problem, and further causes the problem of poor memory effect on long-term video information.

Disclosure of Invention

The application aims to provide a video analysis method, a video analysis device and a storage medium, so as to improve the memory effect of long-term videos.

The embodiment of the application provides a video analysis method, which comprises the following steps:

acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;

determining at least one video characteristic information corresponding to a video to be analyzed;

determining problem characteristic information corresponding to a problem to be solved;

inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information;

and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.

An embodiment of the present application further provides a video analysis apparatus, including:

the acquisition module is used for acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;

the first determining module is used for determining at least one video characteristic information corresponding to a video to be analyzed;

the second determination module is used for determining question characteristic information corresponding to the question to be solved;

the third determining module is used for inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information;

and the fourth determining module is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.

The trained video memory model comprises a first submodel and a second submodel, and the first determining module specifically comprises:

an extraction unit for extracting a plurality of video frames from a video to be analyzed;

the first determining unit is used for determining at least one type of video characteristic information corresponding to each video frame;

the third determining module specifically includes:

the second determining unit is used for sequentially inputting the at least one type of video characteristic information corresponding to the plurality of video frames into the first submodel according to the time sequence for processing so as to obtain first memory content corresponding to each video frame;

and the third determining unit is used for determining first target characteristic information related to the problem to be solved from the video characteristic information corresponding to each video frame according to at least one video characteristic information, the problem characteristic information, the first memory content and the second submodel corresponding to the plurality of video frames.

Wherein the third determining unit is specifically configured to:

determining a current video frame from a plurality of video frames according to a time sequence, and acquiring first memory content and first target characteristic information corresponding to a previous video frame as first historical memory content and first historical characteristic information respectively;

inputting at least one video characteristic information, question characteristic information, first historical memory content and first historical characteristic information corresponding to a current video frame into a second submodel for processing, so that the second submodel determines first target characteristic information related to a question to be answered from the at least one video characteristic information corresponding to the current video frame;

and respectively updating the first memory content and the first target characteristic information corresponding to the current video frame into first historical memory content and first historical characteristic information, updating the current video frame by using the residual video frames, and then returning to execute the step of inputting at least one of the video characteristic information, the problem characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into a second submodel for processing.

The method includes the steps that at least one type of video feature information includes dynamic feature information and static feature information, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, the first target feature information related to a problem to be solved is determined from at least one type of video feature information corresponding to a current video frame, and the method specifically includes the following steps:

according to the dynamic characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the question to be answered from the dynamic characteristic information corresponding to the current video frame;

according to the static characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target static characteristic information related to the question to be answered from the static characteristic information corresponding to the current video frame;

and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the first historical feature information and the question feature information.

The fourth determining module specifically includes:

the fourth determining unit is used for inputting at least one type of video characteristic information and problem characteristic information into the trained problem memory model for processing so as to determine second target characteristic information related to the video to be analyzed from the problem characteristic information;

and the fifth determining unit is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.

The problem feature information includes a plurality of word feature information, the trained problem memory model includes a third submodel and a fourth submodel, and the fourth determining unit specifically includes:

the first determining subunit is used for sequentially inputting the plurality of word characteristic information into the third submodel according to the word sequence of the question to be solved and processing the word characteristic information to obtain second memory content corresponding to each word characteristic information;

and the second determining subunit is used for determining second target characteristic information related to the video to be analyzed from each word characteristic information according to the plurality of word characteristic information, the at least one video characteristic information, the second memory content and the fourth submodel.

Wherein the second determining subunit is specifically configured to:

determining current word characteristic information from the plurality of word characteristic information according to the word sequence, and acquiring second memory content and second target characteristic information corresponding to the previous word characteristic information as second historical memory content and second historical characteristic information respectively;

inputting the current word feature information, at least one video feature information, second historical memory content and second historical feature information into a fourth submodel for processing, so that the fourth submodel determines second target feature information related to the video to be analyzed from the current word feature information;

and respectively updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information, updating the current word characteristic information by using the residual word characteristic information, and returning to execute the step of inputting the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing.

Wherein, the fifth determining unit specifically includes:

the third determining subunit is configured to obtain a first target feature matrix according to the first target feature information corresponding to the multiple video frames, and obtain a second target feature matrix according to the second target feature information corresponding to the multiple word feature information;

the fourth determining subunit is configured to input the first target feature matrix into the trained first self-attention model for processing to obtain first semantic remote dependency information of the first target feature information, and input the second target feature matrix into the trained second self-attention model for processing to obtain second semantic remote dependency information of the second target feature information;

and the fifth determining subunit is used for determining answer information corresponding to the question to be solved according to the first semantic remote dependency information and the second semantic remote dependency information.

The embodiment of the application also provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute any one of the video analysis methods.

The embodiment of the present application further provides a server, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps in any one of the video analysis methods when executing the computer program.

According to the video analysis method, the video analysis device and the storage medium, the video to be analyzed and the problem to be solved related to the video to be analyzed are obtained, then at least one piece of video characteristic information corresponding to the video to be analyzed is determined, the problem characteristic information corresponding to the problem to be solved is determined, then at least one piece of video characteristic information and the problem characteristic information are input into a trained video memory model to be processed, first target characteristic information related to the problem to be solved is determined from the video characteristic information, then answer information corresponding to the problem to be solved is determined according to the first target characteristic information and the problem characteristic information, and therefore when the video is subjected to semantic understanding analysis, the video can be subjected to targeted memory by taking the problem as guidance, and further the memory effect of the long-term video is improved.

Drawings

The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a schematic view of a scene of a video analysis system provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a video analysis method provided in an embodiment of the present application;

FIG. 3 is a screenshot of a video to be analyzed according to an embodiment of the present application;

fig. 4 is another schematic flow chart of a video analysis method provided in an embodiment of the present application;

fig. 5 is a schematic execution flow diagram of a video analysis method according to an embodiment of the present application.

FIG. 6 is a schematic structural diagram of a video memory model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a problem memory model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating the effect of performing targeted memory on a video to be analyzed and a question to be answered according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video analysis apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The video analysis method provided by the embodiment of the application can be used for carrying out semantic understanding analysis on video contents through a computer vision technology. Computer vision is a science for researching how to make a machine look, and further, computer vision refers to machine vision of identifying, tracking, measuring and the like of a target by using a camera and a computer instead of human eyes, and further performing graphic processing, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, in particular to a video analysis method, a video analysis device and a storage medium.

Referring to fig. 1, fig. 1 is a schematic view of a scene of a video analysis system according to an embodiment of the present disclosure, where the video analysis system may include any one of the video analysis devices according to the embodiment of the present disclosure, and the video analysis device may be specifically integrated in a server, such as a video server, where the server may be a single server or a server cluster composed of multiple servers.

The server can obtain a video to be analyzed and a problem to be solved related to the video to be analyzed; determining at least one video characteristic information corresponding to a video to be analyzed; determining problem characteristic information corresponding to a problem to be solved; inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information; and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.

In addition, the video analysis system may further include a terminal connected to the server via a network, where the terminal may be a device having a video playing function, such as a smart phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC).

Specifically, as shown in fig. 1, the terminal may play a video V1, and receive a question (for example, "what is a person doing before striking the opponent's face after lifting up") input by the user for the played video V1 during the video playing process to trigger the server to analyze the played video V1, and then, the terminal may receive an answer (for example, "throw") sent by the server, and provide the answer to the user.

As shown in fig. 2, fig. 2 is a schematic flow chart of a video analysis method provided in the embodiment of the present application, and a specific flow of the video analysis method may be as follows:

s101, obtaining a video to be analyzed and a problem to be solved related to the video to be analyzed.

Here, the question to be solved may be a question that is input on the terminal for the video to be analyzed when the user views the video to be analyzed using the terminal, and specifically may be a question described in natural language, for example, as shown in fig. 3, for the video to be analyzed V2, the question to be solved related to the video to be analyzed V2 may be "what is held by woman? ". The video to be analyzed may be a complete existing video stored locally at the terminal or on the server. It can be understood that the video to be analyzed contains content related to the answer to the question, that is, the semantic understanding of the video to be analyzed through the computer vision technology can obtain the answer corresponding to the question to be solved, for example, as shown in fig. 3 (i.e., an image of one frame in the video to be analyzed V2), the question to be solved "what is held by woman? The "corresponding answer is" cat ".

S102, determining at least one video characteristic information corresponding to the video to be analyzed.

As shown in fig. 4, the S102 may specifically include:

and S1021, extracting a plurality of video frames from the video to be analyzed.

In particular, since the number of frames of a video is very large, unnecessary redundancy is easily caused by processing each frame of image, and the amount of processing is also excessive. Therefore, the video analysis apparatus may extract a plurality of video frames from the video to be analyzed at a preset time interval (e.g., 1 second) or a preset frame interval (e.g., 60 frames), where each video frame may be a still image. For example, taking the total duration of the video to be analyzed as 3 minutes as an example, the video analysis apparatus may extract one frame from the video to be analyzed every 1 second from the first frame of the video to be analyzed to obtain 181 video frames. It can be understood that the sampling time interval or the sampling frame interval of the video frame should be appropriately sized to improve the efficiency of video analysis without affecting the semantic understanding accuracy of the video to be analyzed.

And S1022, determining at least one video characteristic information corresponding to each video frame.

Specifically, the at least one piece of video feature information may include dynamic feature information and static feature information, where the static feature information is used to characterize the feature of one video frame itself, and the dynamic feature information is used to characterize the dynamic feature of the process from another video frame to the video frame in the video to be analyzed.

In a specific embodiment, as shown in fig. 4, the S1022 may include:

s1-1, extracting original dynamic characteristic information from each video frame by using a dynamic characteristic extraction network, and coding the original dynamic characteristic information to obtain corresponding dynamic characteristic information.

Specifically, as shown in fig. 5, the video analysis apparatus may input a plurality of video frames F extracted from a video to be analyzed into a dynamic feature extraction network (e.g., a trained C3D network) for processing, so as to obtain original dynamic feature information F corresponding to each video frame F^m _tAnd the original dynamic characteristic information F corresponding to the plurality of video frames F is processed according to the time sequence^m _tSequencing to obtain the original dynamic characteristic information sequence F of the video to be analyzed^m，F^m＝(f^m ₁，f^m ₂，...，f^m _N). Then, the video analysis device can analyze the original motion feature information sequence F^mInputting into a first encoder B1 composed of long-short term memory network (LSTM) for encoding to obtain dynamic characteristic information sequence I corresponding to video to be analyzed^m，I^m＝(i^m ₁，i^m ₂，...，i^m _N) The tth video frame corresponds to the dynamic characteristic information sequence I^mThe tth dynamic characteristic information i in^m _t. Where N is equal to the number of frames sampled, the superscript m represents the extracted dynamic features of the video, and t is located in [1, N ]]Natural numbers in the interval, the C3D network is a 3-dimensional convolutional network.

S1-2, extracting original dynamic characteristic information from each video frame by using a static characteristic extraction network, and coding the original static characteristic information to obtain corresponding static characteristic information.

Specifically, as shown in fig. 5, the video analysis apparatus may input a plurality of video frames F extracted from a video to be analyzed into a static feature extraction network (e.g., trainedResNet network or VGG network) to obtain the original static feature information F corresponding to each video frame F^a _tAnd the original dynamic characteristic information F corresponding to the plurality of video frames F is processed according to the time sequence^a _tSequencing to obtain the original static characteristic information sequence F of the video to be analyzed^a，F^a＝(f^a ₁，f^a ₂，...，f^a _N). Then, the video analysis device can analyze the original static feature information sequence F^aInputting into a second encoder B2 composed of long-short term memory network (LSTM) for encoding to obtain static characteristic information sequence I corresponding to the video to be analyzed^a，I^a＝(i^a ₁，i^a ₂，...，i^a _N) The tth video frame corresponds to the static characteristic information sequence I^aThe tth dynamic characteristic information i in^a _t. Where N is equal to the number of frames sampled, the superscript a represents the extracted static feature of the video, and t is located at [1, N%]And in the natural number in the interval, the ResNet network is a residual network, and the VGG network is a deep convolutional neural network.

And S103, determining question characteristic information corresponding to the question to be solved.

As shown in fig. 4, the S103 may specifically include:

and S1031, converting each word in the question to be solved into a corresponding word vector so as to obtain a question embedded representation corresponding to the question to be solved.

Specifically, as shown in fig. 5, the video analysis apparatus may convert the question to be solved into a word sequence C, C ═ C (C ═ C), according to the word sequence of the question to be solved₁，c₂，...，c_T) The t-th word in the question to be solved corresponds to the t-th word C in the word sequence C_t. Next, a word mapping method can be used to map each word ct in the word sequence C to its semantic expression using an embedding layer, and initialize the semantic expression using a trained GloVe model to obtain a 300-D (300-dimensional) word vector q_tFurther obtain the question correspondence to be solvedThe question of (a) indicates that Q, Q ═ Q (Q) is embedded₁，q₂，...，q_T). Wherein T is the number of words in the question to be solved, and T is the number of words in [1, N ]]A natural number in the interval.

And S1032, encoding the problem embedding representation to obtain problem characteristic information corresponding to the problem to be processed.

Specifically, the question embedding representation Q may be input into a third encoder B3 composed of a long-short term memory network for encoding to obtain question feature information I corresponding to the question to be processed^q＝(i^q ₁，i^q ₂，...，i^q _N). It is understood that the above-mentioned problem characteristic information I^qContaining a plurality of word feature information i^q _tWord feature information i^q _tCorresponding to the t-th word in the question to be solved. Wherein T is the number of words in the question to be solved, and T is the number of words in [1, N ]]A natural number in the interval.

And S104, inputting at least one of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be answered from the video characteristic information.

Wherein the at least one video feature information may include dynamic feature information i^m _tAnd static feature information i^a _tAnd each kind of video characteristic information is obtained by extracting original video characteristic information from a corresponding video frame by using a corresponding characteristic extraction network and coding the original video characteristic information. The question feature information may include a plurality of word feature information i^q _tAnd each word feature information i^q _tThe method is obtained by converting corresponding words in the questions to be solved into corresponding word vectors and coding the word vectors. The trained video memory model may include a first sub-model and a second sub-model connected to the first sub-model, and the S104 may specifically include:

s1041, sequentially inputting at least one video characteristic information corresponding to a plurality of video frames into the first sub-model according to a time sequence for processing, so as to obtain a first memory content corresponding to each video frame.

Specifically, S1041 may specifically include:

s2-1, determining a current video frame from a plurality of video frames according to the time sequence, and acquiring first target characteristic information and first memory content corresponding to the previous video frame as first historical characteristic information and first historical memory content respectively.

Specifically, as shown in fig. 6, the video memory model may further include a video memory layer M^vAnd at least one hidden layer h^m/h^a/h^vWherein, the memory layer M^v＝(m₁，m₂，...，m_S) I.e. video memory layer M^vHas S memory states. In this embodiment, when the current video frame is a video frame with a time sequence arranged at the first position in the plurality of video frames extracted from the video to be analyzed, that is, there is no video frame with a time sequence before the current video frame in the plurality of video frames, the video analysis apparatus may enable at least one hidden layer h in the video memory model to be hidden^m，h^aAnd h^vThe initial value of the state parameter of (a), that is,

and

as the first target characteristic information corresponding to the last video frame of the current video frame, the memory layer M in the video memory model is used^vInitial value of the state parameter of (i.e., M)₀And the first memory content is used as the first memory content corresponding to the last video frame of the current video frame. And, the above-mentioned memory layer M^vAnd at least one hidden layer h^m，h^aAnd h^vThe initial value of the state parameter(s) can be obtained by pre-training the video memory model.

And S2-2, inputting at least one video characteristic information, first historical characteristic information and first historical memory content corresponding to the current video frame into a first sub-model for processing, so that the first sub-model determines the first memory content corresponding to the current video frame.

Specifically, after the first memory content corresponding to the current video frame is obtained, the memory layer M in the video memory model may be further processed by using the first memory content corresponding to the current video frame^vThe plurality of state parameter values are updated to store the first memory content in the memory layer M^vIn (1).

S2-3, updating the first target characteristic information and the first memory content corresponding to the current video frame into first historical characteristic information and first historical memory content respectively, updating the current video frame by using the residual video frames, and then returning to execute the step S2-2.

Thus, the above-mentioned S2-2 and S2-3 can form a loop, and each loop obtains the updated first memory content corresponding to the current video frame until the first memory contents corresponding to all the video frames are obtained.

S1042, according to at least one video characteristic information, the problem characteristic information, the first memory content and the second submodel corresponding to the video frames, determining first target characteristic information related to the problem to be solved from the video characteristic information corresponding to each video frame.

Specifically, the S1042 may specifically include:

s3-1, determining a current video frame from a plurality of video frames according to the time sequence, and acquiring a first memory content and first target characteristic information corresponding to a previous video frame as a first historical memory content and first historical characteristic information respectively.

The specific embodiment of the step S3-1 can be found in the specific embodiment of the step S2-1, and thus, the detailed description thereof is omitted here.

And S3-2, inputting the at least one video characteristic information, the question characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into a second submodel for processing, so that the second submodel determines first target characteristic information related to the question to be solved from the at least one video characteristic information corresponding to the current video frame.

Specifically, after the first target feature information corresponding to the current video frame is obtained, the first target feature information corresponding to the current video frame may be used to conceal each layer h in the video memory model^m/h^a/h^vIs updated to store the first target characteristic information in the at least one hidden layer h^m/h^a/h^vIn (1).

S3-3, updating the first memory content and the first target characteristic information corresponding to the current video frame into the first historical memory content and the first historical characteristic information respectively, updating the current video frame by using the residual video frames, and then returning to execute the step S3-2.

Thus, the above-mentioned S3-2 and S3-3 can form a loop, and each loop determines the first target feature information related to the question to be solved from the updated video feature information corresponding to the current video frame until the first target feature information corresponding to all the video frames is obtained.

The above updating the current video frame by using the remaining video frames can be understood as updating the current video frame by using the video frame which is arranged one bit behind the current video frame in time sequence in the remaining video frames.

In a specific embodiment, the at least one video feature information may include dynamic feature information and static feature information, and accordingly, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, and the second submodel determines, from the at least one video feature information corresponding to the current video frame, the first target feature information related to the question to be solved, and may specifically include:

according to the dynamic characteristic information, the first historical memory content, the problem characteristic information and the first historical characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the problem to be solved from the dynamic characteristic information corresponding to the current video frame;

according to the static characteristic information, the first historical memory content, the question characteristic information and the first historical characteristic information corresponding to the current video frame, determining target static characteristic information related to the question to be answered from the static characteristic information corresponding to the current video frame;

and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the question feature information and the first historical feature information.

Specifically, after obtaining the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video frame, the hidden layer h in the video memory model may be respectively hidden by using the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video frame^mA hidden layer h^aAnd a hidden layer h^vThe state parameter value of (a) is updated to store the target dynamic characteristic information in the hidden layer h^mIn the method, the target static characteristic information is stored in the hidden layer h^aAnd storing the target global feature information in the hidden layer h^vIn (1). The target global feature information may be used to represent fusion semantic information of the dynamic feature information and the static feature information of the video to be analyzed.

Correspondingly, the determining, by the first submodel, the first memory content corresponding to the current video frame may specifically include:

and determining dynamic characteristic memory content corresponding to the current video frame according to the target dynamic characteristic information contained in the first historical characteristic information and the dynamic characteristic information corresponding to the current video frame.

And determining static characteristic memory content corresponding to the current video frame according to the target static characteristic information contained in the first historical characteristic information and the static characteristic information corresponding to the current video frame.

And determining the first memory content corresponding to the current video frame according to the dynamic characteristic memory content and the static characteristic memory content corresponding to the current video frame, the target global characteristic information contained in the first historical characteristic information and the first historical memory content.

For a specific example, in the second submodel, the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video frame may be obtained through calculation formulas (1) to (4), where the calculation formulas are as follows:

r_t＝β_t·M_t-1 (1)

wherein, represents the inner product,

are learnable weights. Beta is a_tIs the target dynamic characteristic information corresponding to the last video frame of the current video frame

Target static feature information

And target global feature information

The determined read weight. In one embodiment, d may be 512. FC denotes a full link layer, using tanh (hyperbolic tangent function) as a nonlinear activation function. r is_tMemory layer M representing a memory model from the video^vThe content read in (1), specifically the memory layer M^vA weighted sum of a plurality of memory states. Then, based on the currently read content r_tAnd the currently input dynamic feature information

Static feature information

And problem characteristic information i^qCalculating the target dynamic characteristic information corresponding to the current video frame by using the calculation formulas (3) and (4)

(i.e., the hidden layer h at the current time t^mState parameter value of), target static characteristic information

(i.e., the hidden layer h at the current time t^aState parameter values) and target global feature information

(i.e., the hidden layer h at the current time t^vThe value of the state parameter(s).

Wherein the content of the first and second substances,

to represent

Or

σ denotes a sigmoid function. In this embodiment, by including the question guidance in the status parameter value updating operation of the hidden layer at each time step, the video memory module enhanced by the question (text) can store the video content most relevant to the question, thereby improving the storage efficiency of the video information.

Accordingly, in the first submodel, the first memory content M corresponding to the current video frame can be calculated and obtained through the calculation formulas (5) to (10)_t(i.e., the memory layer M at the current time t^vThe state parameter value of (2), wherein the calculation formula is as follows:

wherein, c_tIs the dynamic characteristic information of the target corresponding to the last video frame

And target static feature information

The determined content vector, W, is a learnable parameter, and b is a bias. Content vector c_tWill be used to calculate the corresponding write weight of the current video frame

In the internal memory layer M^vWhen updating the state parameter value, it is necessary to consider how much the dynamic characteristic information and the static characteristic information of the video to be analyzed respectively account for, that is, calculate

As shown in the calculation formula (7).

Is formed by c_tA weight between 0 and 1 obtained via a softmax function.

In the internal memory layer M^vWhen updating the state parameter value(s), the first memory content corresponding to the last video frame (i.e. the memory layer M at the last time (t-1)) also needs to be considered^vThe value of the state parameter) how much needs to be preserved at the current time t, this ratio is mu. μ is a weight between 0 and 1 derived from g through a softmax function. g is the target global feature information corresponding to the last video frame

And c in the calculation formulas (5) and (6)_tAnd (4) determining.

Finally, the first memory content M corresponding to the current video frame_t(i.e., the memory layer M at the current time t^vThe state parameter value) can be calculated by the calculation formula (10).

And S105, according to the first target characteristic information and the question characteristic information, determining answer information corresponding to the question to be answered.

As shown in fig. 4, the S105 may specifically include:

s1051, inputting at least one video characteristic information and question characteristic information into a trained question memory model for processing, so as to determine second target characteristic information related to a video to be analyzed from the question characteristic information.

The question feature information may include a plurality of word feature information, the trained question memory model may include a third sub-model and a fourth sub-model connected to the third sub-model, and the S1051 may specifically include:

and S4-1, sequentially inputting the plurality of word characteristic information into a third submodel according to the word sequence of the question to be solved, and processing to obtain second memory content corresponding to each word characteristic information.

Specifically, the S4-1 may specifically include:

s4-1-1, determining current word characteristic information from the plurality of word characteristic information according to the word sequence, and acquiring second target characteristic information and second memory content corresponding to the previous word characteristic information as second history characteristic information and second history memory content respectively.

Specifically, as shown in fig. 7, the problem memory model may further include a problem memory layer M^qAnd a hidden layer h^qWherein, the memory layer M^q＝(m₁，m₂，...，m_S) I.e. problem memory layer M^qHas S memory states. In this embodiment, when the current word feature information is the word feature information in which the word sequence is first in the plurality of word feature information, that is, when there is no word feature information in which the word sequence is before the current word feature information in the plurality of word feature information, the video analysis apparatus may hide the hidden layer h in the problem memory model^qThe initial value of the state parameter of (a), that is,

the second target characteristic information corresponding to the previous word characteristic information as the current word characteristic information is used for memorizing the memory layer M in the problem memory model^qInitial value of the state parameter of (i.e., M)^q ₀And the second memory content is used as the second memory content corresponding to the last word characteristic information of the current word characteristic information. And, the initial value of the state parameter can be obtained by pre-training the video memory model.

And S4-1-2, inputting the current word characteristic information, the second historical characteristic information and the second historical memory content into a third submodel for processing, so that the third submodel determines the second memory content corresponding to the current word characteristic information.

Specifically, after the second memory content corresponding to the current word feature information is obtained, the second memory content corresponding to the current word feature information may be used to store the memory layer M in the problem memory model^qUpdating the plurality of state parameter values to store the second memory content in the memory layer M^qIn (1).

S4-1-3, updating the second target characteristic information and the second memory content corresponding to the current word characteristic information into second historical characteristic information and second historical memory content respectively, updating the current word characteristic information by using the residual word characteristic information, and then returning to execute the step S4-1-2.

Thus, the above-mentioned S4-1-2 and S4-1-3 can form a loop, and each loop will obtain the second memory content corresponding to the updated current word feature information until the second memory contents corresponding to all the word feature information are obtained.

And S4-2, according to the word feature information, the video feature information, the second memory content and the fourth submodel, determining second target feature information related to the video to be analyzed from each word feature information.

Specifically, the above S4-2 may include:

s4-2-1, determining current word characteristic information from the word characteristic information according to the word sequence, and acquiring second memory content and second target characteristic information corresponding to the previous word characteristic information as second historical memory content and second historical characteristic information respectively.

The specific embodiment of the S4-2-1 can be found in the specific embodiment of the S4-1-1, and thus, the detailed description thereof is omitted here.

And S4-2-2, inputting the current word characteristic information, the at least one video characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing, so that the fourth submodel determines second target characteristic information related to the video to be analyzed from the current word characteristic information.

Specifically, after the second target feature information corresponding to the current word feature information is obtained, the hidden layer h in the problem memory model may be hidden by using the second target feature information corresponding to the current word feature information^qUpdating the value of the state parameter to store the second target characteristic information in the hidden informationHidden layer h^qIn (1).

S4-2-3, updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into a second historical memory content and a second historical characteristic information respectively, updating the current word characteristic information by using the residual word characteristic information, and then returning to execute the step S4-2-2.

Thus, the above-mentioned S4-2-2 and S4-2-3 can form a loop, and each loop determines the second target feature information related to the video to be analyzed from the updated current word feature information until the second target feature information corresponding to all the word feature information is obtained.

The updating of the current word feature information by the remaining word feature information may be understood as updating the current word feature information by word feature information in which a word order in the remaining word feature information is one bit after the current word feature information, where the word order may be understood as a sequence of occurrence of each word in a question to be solved, for example, what the person is doing in the question to be solved, the question to be solved includes seven words, i.e., "this", "person", "now", "dry", "sh", and "so", and the sequence of occurrence may be "this", "person", "now", "dry", "sh", and "so".

Specifically, for example, in the fourth submodel, the second target feature information corresponding to the current word feature information may be obtained by calculating formulas (11) to (13)

Wherein, the calculation formula is as follows:

wherein, represents the inner product,

are learnable weights.

Is the second target characteristic information corresponding to the last word characteristic information of the current word characteristic information

And current word feature information input at current time t

The determined read weight. r is_tMemory layer M representing a memory model from the above problems^qThe content read in (1), specifically the memory layer M^qA weighted sum of a plurality of memory states. Then, based on the second target characteristic information corresponding to the previous word characteristic information

Currently read content r_tAnd current word feature information inputted at current time t

Dynamic characteristic information i^mAnd static feature information i^aCalculating the current word feature information by using the calculation formula (13)

Corresponding second target characteristic information

(i.e., the hidden layer h at the current time t^qThe value of the state parameter(s).

Accordingly, theIn the third submodel, the current word feature information can be calculated by calculating equations (14) to (16)

Corresponding second memory content

(i.e., the memory layer M at the current time t^qThe state parameter value of (2), wherein the calculation formula is as follows:

wherein the content of the first and second substances,

is the current word feature information input by the current time t

And second target characteristic information corresponding to the previous word characteristic information

The determined content vector. Content vector

Will be used for calculating the writing weight alpha corresponding to the current word characteristic information_t，iAs shown in the calculation formula (15).

Memory layer M of the above problem memory model^qWrite weight alpha of all memory states in_t，iDepending on the current timeContent vector of t

And second target feature information corresponding to the previous word feature information.

Finally, current word feature information

Corresponding second memory content

(i.e., the memory layer M at the current time t^qThe state parameter value) can be calculated by the calculation formula (16), wherein,

s is a memory layer M^qThe number of memory states involved.

It can be understood that the video memory model and the problem memory model in the above embodiments are guided by cross-modal information, and can respectively realize a function of performing targeted memory on a long-term video and a problem long sentence, thereby playing an active role in understanding the long-term video and the problem long sentence. For example, as shown in fig. 8, the video to be analyzed is a wrestling game video, and in the video to be analyzed, a man first lifts an opponent, then falls on the ground and waves a fist. The memory weight of the video shown in fig. 8 indicates that the above-mentioned video memory model can solve the problem "What is a man-doing after-lifting his open patent before doing his face? The information (such as ' lifting his option ' and ' punches his face ') related to the question to be solved in the video to be analyzed is stored in a targeted manner under the guidance of ' the video to be analyzed, wherein the deeper the corresponding color of the video frame of the video to be analyzed on the memory weight bar of the video, the more the memory content of the video frame. Accordingly, the memory weight of the question shown in fig. 8 indicates that the question memory model can specifically memorize the information related to the video to be analyzed in the question to be solved under the guidance of a plurality of video frames of the video to be analyzed, wherein the deeper the corresponding color of each word of the question to be solved on the memory weight bar of the question indicates that the more the memory content of the word is.

Based on the above analysis, the video memory model and the question memory model provided by the embodiment are not limited to be applied to the field of video question answering. In some embodiments, the above-described video memory model and problem memory model may also be applied in many fields related to video understanding and cross-modal information analysis (e.g., video retrieval, video understanding, video text matching, etc.). In other embodiments, the video memory model and the question memory model may also be applied to fields such as search recommendation, so as to implement search of video content according to characters, or search of corresponding texts according to videos.

And S1052, determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.

Wherein, the S1052 may specifically include:

s5-1, obtaining a first target characteristic matrix according to first target characteristic information corresponding to a plurality of video frames, and obtaining a second target characteristic matrix according to second target characteristic information corresponding to a plurality of word characteristic information.

Specifically, when the first target feature information includes target dynamic feature information, target static feature information, and target global feature information, the hidden layer h in the video memory model may be used^vConnected to a first target feature matrix V_idTo obtain the target global feature information (i.e. hidden layer h at each time) corresponding to all video frames^vState parameter values) of the first target feature matrix V_id. Accordingly, the hidden layer h in the problem memory model can be used^qConnected to a second target feature matrix T_exTo obtain the second target feature information (i.e. hidden layer h at each time) corresponding to all the word feature information^qState of (1)Parameter values) of the second target feature matrix T_ex。

S5-2, inputting the first target characteristic matrix into the trained first self-attention model for processing to obtain first semantic remote dependence information of the first target characteristic information, and inputting the second target characteristic matrix into the trained second self-attention model for processing to obtain second semantic remote dependence information of the second target characteristic information.

Specifically, the first and second self-attention models described above may be self-attention models based on a scaled dot product attention mechanism as shown in equation (17).

Where Q, K and V represent query (query), key (key), and value (value), respectively.

In one embodiment, the first self-attention model described above may be as shown in equation (18).

V₀＝Attention(V_idW^q，V_idW^k，V_idW^v) (18)

The second self-attention model described above may be as shown in equation (19).

Where W is a learnable parameter. Specifically, the purpose of the first self-attention model is to extract a semantic remote dependency relationship of the concerned video feature information from the video memory model to obtain corresponding first semantic remote sequential information, and the purpose problem memory model of the second self-attention model is to extract a semantic remote dependency relationship of the concerned problem feature information to obtain corresponding second semantic remote sequential information. In this way, by using a self-attention model based on the scaled dot product attention mechanism as a non-local network model, the global dependency relationship of the feature information output by the memory model can be extracted better. For example, as shown in fig. 8, the first self-attention model and the second self-attention model respectively focus on the global dependency relationship of the feature information output by the video memory model and the question memory model to correspondingly obtain the self-attention weight of the video and the self-attention weight of the question, and further determine the correct answer "throw", where a deeper color corresponding to the self-attention weight bar of the video frame of the video to be analyzed indicates that the video frame is more important, and a deeper color corresponding to each word of the question to be solved on the self-attention weight bar of the question indicates that the word is more important.

And S5-3, according to the first semantic remote dependency information and the second semantic remote dependency information, determining answer information corresponding to the question to be answered.

Wherein, the S5-3 may specifically include:

and S5-3-1, fusing the first semantic remote dependency information and the second semantic remote dependency information by using the trained mutual attention model to obtain final characteristic information.

Specifically, the attention a of the video to be analyzed to the question to be solved can be calculated by the following calculation formula (20):

also, the attention B of the question to be solved to the video to be analyzed can be calculated by the following calculation formula (21):

wherein S is a weight matrix,

wherein

Then, the final feature information O for answer prediction may be generated using the trained mutual attention model as shown in equation (22).

O＝Concat(V₀，A，V₀⊙A，V₀⊙B) (22)

Where V is a learnable parameter, which indicates the product between elements, and Concat is a splicing function.

And S5-3-2, inputting the final characteristic information into the trained classification model to obtain an answer corresponding to the question to be solved.

The classification model may be a softmax classifier. In addition, in implementation, the classification model can be optimized by using a hinge loss function (for the case that the problem to be solved is a multi-item selection problem) or a cross entropy loss function (for the case that the problem to be solved is an open problem).

It should be noted that, in this embodiment, an implementation that a video is specifically memorized by taking a question as a guidance is illustrated, but the video is not limited to be memorized by taking a question as a guidance, and in some embodiments, other information related to a video question-answering task may also be used as a guidance to memorize a video. In addition, the video memory model and the question memory model provided in this embodiment can effectively learn the global context awareness information from the video to be analyzed and the question to be answered, and both have larger storage thresholds, that is, stronger storage capabilities, compared with the existing memory model.

As can be seen from the above, in the video analysis method provided in this embodiment, the video to be analyzed and the problem to be solved related to the video to be analyzed are obtained, then at least one piece of video feature information corresponding to the video to be analyzed is determined, the problem feature information corresponding to the problem to be solved is determined, then the at least one piece of video feature information and the at least one piece of problem feature information are input into the trained video memory model for processing, so as to determine the first target feature information related to the problem to be solved from the video feature information, and then the answer information corresponding to the problem to be solved is determined according to the first target feature information and the problem feature information, so that when the video is subjected to semantic understanding analysis, the video can be pertinently memorized by using the problem as guidance, and further, the memory effect of the long-term video is improved.

Based on the method in the foregoing embodiment, the present embodiment will be further described from the perspective of a video analysis apparatus, please refer to fig. 9, where fig. 9 specifically describes the video analysis apparatus provided in the present embodiment, which may include: an obtaining module 610, a first determining module 620, a second determining module 630, a third determining module 640, and a fourth determining module 650, wherein:

(1) acquisition Module 610

The obtaining module 610 is configured to obtain a video to be analyzed and a question to be solved related to the video to be analyzed.

(2) First determination module 620

The first determining module 620 is configured to determine at least one type of video feature information corresponding to a video to be analyzed.

The first determining module 620 specifically includes:

the first determining unit is used for determining at least one type of video characteristic information corresponding to each video frame.

(3) Second determination module 630

The second determining module 630 is configured to determine question feature information corresponding to the question to be solved.

(4) Third determining module 640

And a third determining module 640, configured to input at least one of the video feature information and the question feature information into a trained video memory model for processing, so as to determine, from the video feature information, first target feature information related to the question to be solved.

The trained video memory model may include a first sub-model and a second sub-model, and the third determining module 640 may specifically include:

Specifically, the third determining unit may be configured to perform:

and then the third determining unit returns to re-execute at least one of video characteristic information, problem characteristic information, first historical memory content and first historical characteristic information corresponding to the current video frame to be input into the second submodel for processing.

In a specific embodiment, the at least one video feature information may include dynamic feature information and static feature information, the first target feature information may include target dynamic feature information, target static feature information, and target global feature information, and the second sub-model may specifically perform, when determining the first target feature information related to the question to be solved from the at least one video feature information corresponding to the current video frame, the following steps:

(5) Fourth determination Module 650

The fourth determining module 650 is configured to determine answer information corresponding to the question to be answered according to the first target feature information and the question feature information.

The fourth determining module 650 may specifically include:

In an embodiment, the question feature information casing contains a plurality of word feature information, the trained question memory model may include a third submodel and a fourth submodel, and the fourth determining unit may specifically include:

Specifically, the second determining subunit may be configured to perform:

and respectively updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information, updating the current word characteristic information by using the residual word characteristic information, and then returning to the second determining subunit to re-execute the input of the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing.

In another embodiment, the fifth determining unit may specifically include:

In specific implementation, each of the foregoing sub-units, and modules may be implemented as an independent entity, or may be combined arbitrarily and implemented as one or several entities, and specific implementations of each of the foregoing sub-units, and modules may refer to the foregoing method embodiments, and are not described herein again.

As can be seen from the above, the video analysis apparatus provided in this embodiment includes an obtaining module, configured to obtain a video to be analyzed and a question to be solved related to the video to be analyzed; the first determining module is used for determining at least one video characteristic information corresponding to a video to be analyzed; the second determination module is used for determining question characteristic information corresponding to the question to be solved; the third determining module is used for inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information; and the fourth determining module is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information, so that when the video is semantically understood, the video can be subjected to pointed memory by taking the question as guidance, and the memory effect of the long-term video is further improved.

Correspondingly, an embodiment of the present application further provides a server, where the server may be a single server, or may be a server cluster composed of multiple servers, as shown in fig. 10, which shows a schematic structural diagram of a server according to an embodiment of the present application, and specifically:

the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, Radio Frequency (RF) circuitry 403, a power supply 404, an input unit 405, and a display unit 406. Those skilled in the art will appreciate that the server architecture shown in FIG. 10 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The RF circuit 403 may be used for receiving and transmitting signals during information transmission and reception, and in particular, for receiving downlink information of a base station and then processing the received downlink information by the one or more processors 401; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 403 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 403 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The server also includes a power supply 404 (e.g., a battery) for powering the various components, and preferably, the power supply 404 is logically connected to the processor 401 via a power management system, so that functions such as managing charging, discharging, and power consumption are performed via the power management system. The power supply 404 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The server may further include an input unit 405, and the input unit 405 may be used to receive input numeric or character information and generate a keyboard, mouse, joystick, optical or trackball signal input in relation to user settings and function control. Specifically, in one particular embodiment, input unit 405 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 401, and can receive and execute commands sent by the processor 401. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 405 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The server may also include a display unit 406, and the display unit 406 may be used to display information input by or provided to the user as well as various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 406 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 401 to determine the type of the touch event, and then the processor 401 provides a corresponding visual output on the display panel according to the type of the touch event. Although in FIG. 10 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

Although not shown, the server may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

The server may implement the effective effect that any one of the video analysis apparatuses provided in the embodiments of the present application can implement, which is described in detail in the foregoing embodiments and is not described herein again.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The foregoing detailed description is directed to a video analysis method, a video analysis device, and a storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the methods and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of video analysis, comprising:

determining at least one video characteristic information corresponding to the video to be analyzed;

determining question characteristic information corresponding to the question to be solved;

inputting the at least one type of video characteristic information and the question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information;

2. The video analysis method according to claim 1, wherein the trained video memory model includes a first submodel and a second submodel, and the determining at least one video feature information corresponding to the video to be analyzed specifically includes:

extracting a plurality of video frames from the video to be analyzed;

determining at least one video characteristic information corresponding to each video frame;

the inputting the at least one piece of video feature information and the question feature information into a trained video memory model for processing to determine first target feature information related to the question to be answered from the video feature information specifically includes:

sequentially inputting the at least one video characteristic information corresponding to the plurality of video frames into the first submodel according to a time sequence for processing so as to obtain a first memory content corresponding to each video frame;

and according to the at least one type of video characteristic information, the question characteristic information, the first memory content and the second submodel corresponding to the plurality of video frames, determining first target characteristic information related to the question to be solved from the video characteristic information corresponding to each video frame.

3. The video analysis method according to claim 2, wherein the determining, according to the at least one of the video feature information, the question feature information, the first memory content, and the second submodel corresponding to the plurality of video frames, first target feature information related to the question to be solved from the video feature information corresponding to each of the video frames specifically comprises:

determining a current video frame from the plurality of video frames according to the time sequence, and acquiring first memory content and first target characteristic information corresponding to a previous video frame as first historical memory content and first historical characteristic information respectively;

inputting the at least one video characteristic information, the question characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into the second submodel for processing, so that the second submodel determines first target characteristic information related to the question to be solved from the at least one video characteristic information corresponding to the current video frame;

updating the first memory content and the first target characteristic information corresponding to the current video frame into the first historical memory content and the first historical characteristic information respectively, updating the current video frame by using the residual video frames, and then returning to execute the step of inputting the at least one video characteristic information, the problem characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into the second submodel for processing.

4. The video analysis method according to claim 3, wherein the at least one type of video feature information includes dynamic feature information and static feature information, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, and the determining, from the at least one type of video feature information corresponding to the current video frame, first target feature information related to the question to be solved specifically includes:

according to the static feature information, the first historical memory content, the first historical feature information and the question feature information corresponding to the current video frame, determining target static feature information related to the question to be answered from the static feature information corresponding to the current video frame;

5. The video analysis method according to claim 2, wherein the determining answer information corresponding to the question to be answered according to the first target feature information and the question feature information specifically comprises:

inputting the at least one video characteristic information and the problem characteristic information into a trained problem memory model for processing so as to determine second target characteristic information related to the video to be analyzed from the problem characteristic information;

and determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.

6. The video analysis method according to claim 5, wherein the question feature information includes a plurality of word feature information, the trained question memory model includes a third sub-model and a fourth sub-model, and the inputting the at least one of the video feature information and the question feature information into the trained question memory model for processing to determine a second target feature information related to the video to be analyzed from the question feature information specifically includes:

sequentially inputting the plurality of word characteristic information into the third submodel according to the word sequence of the question to be solved for processing to obtain second memory content corresponding to each word characteristic information;

and determining second target characteristic information related to the video to be analyzed from each word characteristic information according to the plurality of word characteristic information, the at least one video characteristic information, the second memory content and the fourth submodel.

7. The video analysis method according to claim 6, wherein the determining, from each of the word feature information, second target feature information related to the video to be analyzed according to the word feature information, the at least one piece of video feature information, the second memory content, and the fourth submodel, specifically comprises:

determining current word feature information from the plurality of word feature information according to the word sequence, and acquiring second memory content and second target feature information corresponding to the previous word feature information as second historical memory content and second historical feature information respectively;

inputting the current word feature information, the at least one video feature information, the second historical memory content and the second historical feature information into the fourth submodel for processing, so that the fourth submodel determines second target feature information related to the video to be analyzed from the current word feature information;

and updating second memory content and second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information respectively, updating the current word characteristic information by using residual word characteristic information, and returning to execute the step of inputting the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into the fourth submodel for processing.

8. The video analysis method according to claim 6, wherein the determining answer information corresponding to the question to be answered according to the first target feature information and the second target feature information specifically comprises:

obtaining a first target characteristic matrix according to the first target characteristic information corresponding to the plurality of video frames, and obtaining a second target characteristic matrix according to the second target characteristic information corresponding to the plurality of word characteristic information;

inputting the first target characteristic matrix into a trained first self-attention model for processing to obtain first semantic remote dependency information of the first target characteristic information, and inputting the second target characteristic matrix into a trained second self-attention model for processing to obtain second semantic remote dependency information of the second target characteristic information;

and determining answer information corresponding to the question to be answered according to the first semantic remote dependency information and the second semantic remote dependency information.

9. A video analysis apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;

the first determination module is used for determining at least one video characteristic information corresponding to the video to be analyzed;

a third determining module, configured to input the at least one piece of video feature information and the question feature information into a trained video memory model for processing, so as to determine, from the video feature information, first target feature information related to the question to be solved;

10. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor for performing the steps of the video analysis method according to any one of claims 1 to 8.