CN113392686A - Video analysis method, device and storage medium - Google Patents

Video analysis method, device and storage medium Download PDF

Info

Publication number
CN113392686A
CN113392686A CN202011073795.0A CN202011073795A CN113392686A CN 113392686 A CN113392686 A CN 113392686A CN 202011073795 A CN202011073795 A CN 202011073795A CN 113392686 A CN113392686 A CN 113392686A
Authority
CN
China
Prior art keywords
video
characteristic information
question
feature information
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011073795.0A
Other languages
Chinese (zh)
Inventor
单瀛
蔡佳音
袁春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tsinghua University
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Tencent Technology Shenzhen Co Ltd filed Critical Tsinghua University
Priority to CN202011073795.0A priority Critical patent/CN113392686A/en
Publication of CN113392686A publication Critical patent/CN113392686A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a video analysis method, a video analysis device and a storage medium, wherein the video analysis method comprises the following steps: acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed; determining at least one video characteristic information corresponding to a video to be analyzed; determining problem characteristic information corresponding to a problem to be solved; inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information; according to the first target characteristic information and the question characteristic information, answer information corresponding to the question to be answered is determined, so that when the video is subjected to semantic understanding analysis, the video can be subjected to targeted memory by taking the question as guidance, and the memory effect of the long-term video is improved.

Description

Video analysis method, device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a video analysis method, apparatus, and storage medium.
Background
Video question answering (VideoQA) aims at making high-level inferences about the spatio-temporal content of a video and inferring the correct answer to a given video-related question depicted in natural language.
At present, the technical scheme adopted for the video question-answering task is to extract a video expression vector by using a trained deep learning model, then fuse and memorize the characteristics of two modes, namely a video mode and a question mode, through an attention mechanism or a memory model, and finally generate an answer through a classifier.
However, the conventional memory module can memorize a large amount of video information irrelevant to the problem, and further causes the problem of poor memory effect on long-term video information.
Disclosure of Invention
The application aims to provide a video analysis method, a video analysis device and a storage medium, so as to improve the memory effect of long-term videos.
The embodiment of the application provides a video analysis method, which comprises the following steps:
acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;
determining at least one video characteristic information corresponding to a video to be analyzed;
determining problem characteristic information corresponding to a problem to be solved;
inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information;
and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
An embodiment of the present application further provides a video analysis apparatus, including:
the acquisition module is used for acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;
the first determining module is used for determining at least one video characteristic information corresponding to a video to be analyzed;
the second determination module is used for determining question characteristic information corresponding to the question to be solved;
the third determining module is used for inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information;
and the fourth determining module is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
The trained video memory model comprises a first submodel and a second submodel, and the first determining module specifically comprises:
an extraction unit for extracting a plurality of video frames from a video to be analyzed;
the first determining unit is used for determining at least one type of video characteristic information corresponding to each video frame;
the third determining module specifically includes:
the second determining unit is used for sequentially inputting the at least one type of video characteristic information corresponding to the plurality of video frames into the first submodel according to the time sequence for processing so as to obtain first memory content corresponding to each video frame;
and the third determining unit is used for determining first target characteristic information related to the problem to be solved from the video characteristic information corresponding to each video frame according to at least one video characteristic information, the problem characteristic information, the first memory content and the second submodel corresponding to the plurality of video frames.
Wherein the third determining unit is specifically configured to:
determining a current video frame from a plurality of video frames according to a time sequence, and acquiring first memory content and first target characteristic information corresponding to a previous video frame as first historical memory content and first historical characteristic information respectively;
inputting at least one video characteristic information, question characteristic information, first historical memory content and first historical characteristic information corresponding to a current video frame into a second submodel for processing, so that the second submodel determines first target characteristic information related to a question to be answered from the at least one video characteristic information corresponding to the current video frame;
and respectively updating the first memory content and the first target characteristic information corresponding to the current video frame into first historical memory content and first historical characteristic information, updating the current video frame by using the residual video frames, and then returning to execute the step of inputting at least one of the video characteristic information, the problem characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into a second submodel for processing.
The method includes the steps that at least one type of video feature information includes dynamic feature information and static feature information, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, the first target feature information related to a problem to be solved is determined from at least one type of video feature information corresponding to a current video frame, and the method specifically includes the following steps:
according to the dynamic characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the question to be answered from the dynamic characteristic information corresponding to the current video frame;
according to the static characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target static characteristic information related to the question to be answered from the static characteristic information corresponding to the current video frame;
and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the first historical feature information and the question feature information.
The fourth determining module specifically includes:
the fourth determining unit is used for inputting at least one type of video characteristic information and problem characteristic information into the trained problem memory model for processing so as to determine second target characteristic information related to the video to be analyzed from the problem characteristic information;
and the fifth determining unit is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.
The problem feature information includes a plurality of word feature information, the trained problem memory model includes a third submodel and a fourth submodel, and the fourth determining unit specifically includes:
the first determining subunit is used for sequentially inputting the plurality of word characteristic information into the third submodel according to the word sequence of the question to be solved and processing the word characteristic information to obtain second memory content corresponding to each word characteristic information;
and the second determining subunit is used for determining second target characteristic information related to the video to be analyzed from each word characteristic information according to the plurality of word characteristic information, the at least one video characteristic information, the second memory content and the fourth submodel.
Wherein the second determining subunit is specifically configured to:
determining current word characteristic information from the plurality of word characteristic information according to the word sequence, and acquiring second memory content and second target characteristic information corresponding to the previous word characteristic information as second historical memory content and second historical characteristic information respectively;
inputting the current word feature information, at least one video feature information, second historical memory content and second historical feature information into a fourth submodel for processing, so that the fourth submodel determines second target feature information related to the video to be analyzed from the current word feature information;
and respectively updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information, updating the current word characteristic information by using the residual word characteristic information, and returning to execute the step of inputting the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing.
Wherein, the fifth determining unit specifically includes:
the third determining subunit is configured to obtain a first target feature matrix according to the first target feature information corresponding to the multiple video frames, and obtain a second target feature matrix according to the second target feature information corresponding to the multiple word feature information;
the fourth determining subunit is configured to input the first target feature matrix into the trained first self-attention model for processing to obtain first semantic remote dependency information of the first target feature information, and input the second target feature matrix into the trained second self-attention model for processing to obtain second semantic remote dependency information of the second target feature information;
and the fifth determining subunit is used for determining answer information corresponding to the question to be solved according to the first semantic remote dependency information and the second semantic remote dependency information.
The embodiment of the application also provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute any one of the video analysis methods.
The embodiment of the present application further provides a server, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps in any one of the video analysis methods when executing the computer program.
According to the video analysis method, the video analysis device and the storage medium, the video to be analyzed and the problem to be solved related to the video to be analyzed are obtained, then at least one piece of video characteristic information corresponding to the video to be analyzed is determined, the problem characteristic information corresponding to the problem to be solved is determined, then at least one piece of video characteristic information and the problem characteristic information are input into a trained video memory model to be processed, first target characteristic information related to the problem to be solved is determined from the video characteristic information, then answer information corresponding to the problem to be solved is determined according to the first target characteristic information and the problem characteristic information, and therefore when the video is subjected to semantic understanding analysis, the video can be subjected to targeted memory by taking the problem as guidance, and further the memory effect of the long-term video is improved.
Drawings
The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a schematic view of a scene of a video analysis system provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a video analysis method provided in an embodiment of the present application;
FIG. 3 is a screenshot of a video to be analyzed according to an embodiment of the present application;
fig. 4 is another schematic flow chart of a video analysis method provided in an embodiment of the present application;
fig. 5 is a schematic execution flow diagram of a video analysis method according to an embodiment of the present application.
FIG. 6 is a schematic structural diagram of a video memory model according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a problem memory model provided by an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating the effect of performing targeted memory on a video to be analyzed and a question to be answered according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a video analysis apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The video analysis method provided by the embodiment of the application can be used for carrying out semantic understanding analysis on video contents through a computer vision technology. Computer vision is a science for researching how to make a machine look, and further, computer vision refers to machine vision of identifying, tracking, measuring and the like of a target by using a camera and a computer instead of human eyes, and further performing graphic processing, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, in particular to a video analysis method, a video analysis device and a storage medium.
Referring to fig. 1, fig. 1 is a schematic view of a scene of a video analysis system according to an embodiment of the present disclosure, where the video analysis system may include any one of the video analysis devices according to the embodiment of the present disclosure, and the video analysis device may be specifically integrated in a server, such as a video server, where the server may be a single server or a server cluster composed of multiple servers.
The server can obtain a video to be analyzed and a problem to be solved related to the video to be analyzed; determining at least one video characteristic information corresponding to a video to be analyzed; determining problem characteristic information corresponding to a problem to be solved; inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information; and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
In addition, the video analysis system may further include a terminal connected to the server via a network, where the terminal may be a device having a video playing function, such as a smart phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC).
Specifically, as shown in fig. 1, the terminal may play a video V1, and receive a question (for example, "what is a person doing before striking the opponent's face after lifting up") input by the user for the played video V1 during the video playing process to trigger the server to analyze the played video V1, and then, the terminal may receive an answer (for example, "throw") sent by the server, and provide the answer to the user.
As shown in fig. 2, fig. 2 is a schematic flow chart of a video analysis method provided in the embodiment of the present application, and a specific flow of the video analysis method may be as follows:
s101, obtaining a video to be analyzed and a problem to be solved related to the video to be analyzed.
Here, the question to be solved may be a question that is input on the terminal for the video to be analyzed when the user views the video to be analyzed using the terminal, and specifically may be a question described in natural language, for example, as shown in fig. 3, for the video to be analyzed V2, the question to be solved related to the video to be analyzed V2 may be "what is held by woman? ". The video to be analyzed may be a complete existing video stored locally at the terminal or on the server. It can be understood that the video to be analyzed contains content related to the answer to the question, that is, the semantic understanding of the video to be analyzed through the computer vision technology can obtain the answer corresponding to the question to be solved, for example, as shown in fig. 3 (i.e., an image of one frame in the video to be analyzed V2), the question to be solved "what is held by woman? The "corresponding answer is" cat ".
S102, determining at least one video characteristic information corresponding to the video to be analyzed.
As shown in fig. 4, the S102 may specifically include:
and S1021, extracting a plurality of video frames from the video to be analyzed.
In particular, since the number of frames of a video is very large, unnecessary redundancy is easily caused by processing each frame of image, and the amount of processing is also excessive. Therefore, the video analysis apparatus may extract a plurality of video frames from the video to be analyzed at a preset time interval (e.g., 1 second) or a preset frame interval (e.g., 60 frames), where each video frame may be a still image. For example, taking the total duration of the video to be analyzed as 3 minutes as an example, the video analysis apparatus may extract one frame from the video to be analyzed every 1 second from the first frame of the video to be analyzed to obtain 181 video frames. It can be understood that the sampling time interval or the sampling frame interval of the video frame should be appropriately sized to improve the efficiency of video analysis without affecting the semantic understanding accuracy of the video to be analyzed.
And S1022, determining at least one video characteristic information corresponding to each video frame.
Specifically, the at least one piece of video feature information may include dynamic feature information and static feature information, where the static feature information is used to characterize the feature of one video frame itself, and the dynamic feature information is used to characterize the dynamic feature of the process from another video frame to the video frame in the video to be analyzed.
In a specific embodiment, as shown in fig. 4, the S1022 may include:
s1-1, extracting original dynamic characteristic information from each video frame by using a dynamic characteristic extraction network, and coding the original dynamic characteristic information to obtain corresponding dynamic characteristic information.
Specifically, as shown in fig. 5, the video analysis apparatus may input a plurality of video frames F extracted from a video to be analyzed into a dynamic feature extraction network (e.g., a trained C3D network) for processing, so as to obtain original dynamic feature information F corresponding to each video frame Fm tAnd the original dynamic characteristic information F corresponding to the plurality of video frames F is processed according to the time sequencem tSequencing to obtain the original dynamic characteristic information sequence F of the video to be analyzedm,Fm=(fm 1,fm 2,...,fm N). Then, the video analysis device can analyze the original motion feature information sequence FmInputting into a first encoder B1 composed of long-short term memory network (LSTM) for encoding to obtain dynamic characteristic information sequence I corresponding to video to be analyzedm,Im=(im 1,im 2,...,im N) The tth video frame corresponds to the dynamic characteristic information sequence ImThe tth dynamic characteristic information i inm t. Where N is equal to the number of frames sampled, the superscript m represents the extracted dynamic features of the video, and t is located in [1, N ]]Natural numbers in the interval, the C3D network is a 3-dimensional convolutional network.
S1-2, extracting original dynamic characteristic information from each video frame by using a static characteristic extraction network, and coding the original static characteristic information to obtain corresponding static characteristic information.
Specifically, as shown in fig. 5, the video analysis apparatus may input a plurality of video frames F extracted from a video to be analyzed into a static feature extraction network (e.g., trainedResNet network or VGG network) to obtain the original static feature information F corresponding to each video frame Fa tAnd the original dynamic characteristic information F corresponding to the plurality of video frames F is processed according to the time sequencea tSequencing to obtain the original static characteristic information sequence F of the video to be analyzeda,Fa=(fa 1,fa 2,...,fa N). Then, the video analysis device can analyze the original static feature information sequence FaInputting into a second encoder B2 composed of long-short term memory network (LSTM) for encoding to obtain static characteristic information sequence I corresponding to the video to be analyzeda,Ia=(ia 1,ia 2,...,ia N) The tth video frame corresponds to the static characteristic information sequence IaThe tth dynamic characteristic information i ina t. Where N is equal to the number of frames sampled, the superscript a represents the extracted static feature of the video, and t is located at [1, N%]And in the natural number in the interval, the ResNet network is a residual network, and the VGG network is a deep convolutional neural network.
And S103, determining question characteristic information corresponding to the question to be solved.
As shown in fig. 4, the S103 may specifically include:
and S1031, converting each word in the question to be solved into a corresponding word vector so as to obtain a question embedded representation corresponding to the question to be solved.
Specifically, as shown in fig. 5, the video analysis apparatus may convert the question to be solved into a word sequence C, C ═ C (C ═ C), according to the word sequence of the question to be solved1,c2,...,cT) The t-th word in the question to be solved corresponds to the t-th word C in the word sequence Ct. Next, a word mapping method can be used to map each word ct in the word sequence C to its semantic expression using an embedding layer, and initialize the semantic expression using a trained GloVe model to obtain a 300-D (300-dimensional) word vector qtFurther obtain the question correspondence to be solvedThe question of (a) indicates that Q, Q ═ Q (Q) is embedded1,q2,...,qT). Wherein T is the number of words in the question to be solved, and T is the number of words in [1, N ]]A natural number in the interval.
And S1032, encoding the problem embedding representation to obtain problem characteristic information corresponding to the problem to be processed.
Specifically, the question embedding representation Q may be input into a third encoder B3 composed of a long-short term memory network for encoding to obtain question feature information I corresponding to the question to be processedq=(iq 1,iq 2,...,iq N). It is understood that the above-mentioned problem characteristic information IqContaining a plurality of word feature information iq tWord feature information iq tCorresponding to the t-th word in the question to be solved. Wherein T is the number of words in the question to be solved, and T is the number of words in [1, N ]]A natural number in the interval.
And S104, inputting at least one of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be answered from the video characteristic information.
Wherein the at least one video feature information may include dynamic feature information im tAnd static feature information ia tAnd each kind of video characteristic information is obtained by extracting original video characteristic information from a corresponding video frame by using a corresponding characteristic extraction network and coding the original video characteristic information. The question feature information may include a plurality of word feature information iq tAnd each word feature information iq tThe method is obtained by converting corresponding words in the questions to be solved into corresponding word vectors and coding the word vectors. The trained video memory model may include a first sub-model and a second sub-model connected to the first sub-model, and the S104 may specifically include:
s1041, sequentially inputting at least one video characteristic information corresponding to a plurality of video frames into the first sub-model according to a time sequence for processing, so as to obtain a first memory content corresponding to each video frame.
Specifically, S1041 may specifically include:
s2-1, determining a current video frame from a plurality of video frames according to the time sequence, and acquiring first target characteristic information and first memory content corresponding to the previous video frame as first historical characteristic information and first historical memory content respectively.
Specifically, as shown in fig. 6, the video memory model may further include a video memory layer MvAnd at least one hidden layer hm/ha/hvWherein, the memory layer Mv=(m1,m2,...,mS) I.e. video memory layer MvHas S memory states. In this embodiment, when the current video frame is a video frame with a time sequence arranged at the first position in the plurality of video frames extracted from the video to be analyzed, that is, there is no video frame with a time sequence before the current video frame in the plurality of video frames, the video analysis apparatus may enable at least one hidden layer h in the video memory model to be hiddenm,haAnd hvThe initial value of the state parameter of (a), that is,
Figure BDA0002716039970000091
and
Figure BDA0002716039970000092
as the first target characteristic information corresponding to the last video frame of the current video frame, the memory layer M in the video memory model is usedvInitial value of the state parameter of (i.e., M)0And the first memory content is used as the first memory content corresponding to the last video frame of the current video frame. And, the above-mentioned memory layer MvAnd at least one hidden layer hm,haAnd hvThe initial value of the state parameter(s) can be obtained by pre-training the video memory model.
And S2-2, inputting at least one video characteristic information, first historical characteristic information and first historical memory content corresponding to the current video frame into a first sub-model for processing, so that the first sub-model determines the first memory content corresponding to the current video frame.
Specifically, after the first memory content corresponding to the current video frame is obtained, the memory layer M in the video memory model may be further processed by using the first memory content corresponding to the current video framevThe plurality of state parameter values are updated to store the first memory content in the memory layer MvIn (1).
S2-3, updating the first target characteristic information and the first memory content corresponding to the current video frame into first historical characteristic information and first historical memory content respectively, updating the current video frame by using the residual video frames, and then returning to execute the step S2-2.
Thus, the above-mentioned S2-2 and S2-3 can form a loop, and each loop obtains the updated first memory content corresponding to the current video frame until the first memory contents corresponding to all the video frames are obtained.
S1042, according to at least one video characteristic information, the problem characteristic information, the first memory content and the second submodel corresponding to the video frames, determining first target characteristic information related to the problem to be solved from the video characteristic information corresponding to each video frame.
Specifically, the S1042 may specifically include:
s3-1, determining a current video frame from a plurality of video frames according to the time sequence, and acquiring a first memory content and first target characteristic information corresponding to a previous video frame as a first historical memory content and first historical characteristic information respectively.
The specific embodiment of the step S3-1 can be found in the specific embodiment of the step S2-1, and thus, the detailed description thereof is omitted here.
And S3-2, inputting the at least one video characteristic information, the question characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into a second submodel for processing, so that the second submodel determines first target characteristic information related to the question to be solved from the at least one video characteristic information corresponding to the current video frame.
Specifically, after the first target feature information corresponding to the current video frame is obtained, the first target feature information corresponding to the current video frame may be used to conceal each layer h in the video memory modelm/ha/hvIs updated to store the first target characteristic information in the at least one hidden layer hm/ha/hvIn (1).
S3-3, updating the first memory content and the first target characteristic information corresponding to the current video frame into the first historical memory content and the first historical characteristic information respectively, updating the current video frame by using the residual video frames, and then returning to execute the step S3-2.
Thus, the above-mentioned S3-2 and S3-3 can form a loop, and each loop determines the first target feature information related to the question to be solved from the updated video feature information corresponding to the current video frame until the first target feature information corresponding to all the video frames is obtained.
The above updating the current video frame by using the remaining video frames can be understood as updating the current video frame by using the video frame which is arranged one bit behind the current video frame in time sequence in the remaining video frames.
In a specific embodiment, the at least one video feature information may include dynamic feature information and static feature information, and accordingly, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, and the second submodel determines, from the at least one video feature information corresponding to the current video frame, the first target feature information related to the question to be solved, and may specifically include:
according to the dynamic characteristic information, the first historical memory content, the problem characteristic information and the first historical characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the problem to be solved from the dynamic characteristic information corresponding to the current video frame;
according to the static characteristic information, the first historical memory content, the question characteristic information and the first historical characteristic information corresponding to the current video frame, determining target static characteristic information related to the question to be answered from the static characteristic information corresponding to the current video frame;
and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the question feature information and the first historical feature information.
Specifically, after obtaining the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video frame, the hidden layer h in the video memory model may be respectively hidden by using the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video framemA hidden layer haAnd a hidden layer hvThe state parameter value of (a) is updated to store the target dynamic characteristic information in the hidden layer hmIn the method, the target static characteristic information is stored in the hidden layer haAnd storing the target global feature information in the hidden layer hvIn (1). The target global feature information may be used to represent fusion semantic information of the dynamic feature information and the static feature information of the video to be analyzed.
Correspondingly, the determining, by the first submodel, the first memory content corresponding to the current video frame may specifically include:
and determining dynamic characteristic memory content corresponding to the current video frame according to the target dynamic characteristic information contained in the first historical characteristic information and the dynamic characteristic information corresponding to the current video frame.
And determining static characteristic memory content corresponding to the current video frame according to the target static characteristic information contained in the first historical characteristic information and the static characteristic information corresponding to the current video frame.
And determining the first memory content corresponding to the current video frame according to the dynamic characteristic memory content and the static characteristic memory content corresponding to the current video frame, the target global characteristic information contained in the first historical characteristic information and the first historical memory content.
For a specific example, in the second submodel, the target dynamic feature information, the target static feature information, and the target global feature information corresponding to the current video frame may be obtained through calculation formulas (1) to (4), where the calculation formulas are as follows:
rt=βt·Mt-1 (1)
Figure BDA0002716039970000121
wherein, represents the inner product,
Figure BDA0002716039970000122
are learnable weights. Beta is atIs the target dynamic characteristic information corresponding to the last video frame of the current video frame
Figure BDA0002716039970000123
Target static feature information
Figure BDA0002716039970000124
And target global feature information
Figure BDA0002716039970000125
The determined read weight. In one embodiment, d may be 512. FC denotes a full link layer, using tanh (hyperbolic tangent function) as a nonlinear activation function. r istMemory layer M representing a memory model from the videovThe content read in (1), specifically the memory layer MvA weighted sum of a plurality of memory states. Then, based on the currently read content rtAnd the currently input dynamic feature information
Figure BDA0002716039970000126
Static feature information
Figure BDA0002716039970000127
And problem characteristic information iqCalculating the target dynamic characteristic information corresponding to the current video frame by using the calculation formulas (3) and (4)
Figure BDA0002716039970000128
(i.e., the hidden layer h at the current time tmState parameter value of), target static characteristic information
Figure BDA0002716039970000129
(i.e., the hidden layer h at the current time taState parameter values) and target global feature information
Figure BDA00027160399700001210
(i.e., the hidden layer h at the current time tvThe value of the state parameter(s).
Figure BDA00027160399700001211
Figure BDA00027160399700001212
Wherein the content of the first and second substances,
Figure BDA00027160399700001213
to represent
Figure BDA00027160399700001214
Or
Figure BDA00027160399700001215
σ denotes a sigmoid function. In this embodiment, by including the question guidance in the status parameter value updating operation of the hidden layer at each time step, the video memory module enhanced by the question (text) can store the video content most relevant to the question, thereby improving the storage efficiency of the video information.
Accordingly, in the first submodel, the first memory content M corresponding to the current video frame can be calculated and obtained through the calculation formulas (5) to (10)t(i.e., the memory layer M at the current time tvThe state parameter value of (2), wherein the calculation formula is as follows:
Figure BDA00027160399700001216
Figure BDA00027160399700001217
wherein, ctIs the dynamic characteristic information of the target corresponding to the last video frame
Figure BDA00027160399700001218
And target static feature information
Figure BDA00027160399700001219
The determined content vector, W, is a learnable parameter, and b is a bias. Content vector ctWill be used to calculate the corresponding write weight of the current video frame
Figure BDA00027160399700001220
Figure BDA00027160399700001221
In the internal memory layer MvWhen updating the state parameter value, it is necessary to consider how much the dynamic characteristic information and the static characteristic information of the video to be analyzed respectively account for, that is, calculate
Figure BDA0002716039970000131
As shown in the calculation formula (7).
Figure BDA0002716039970000132
Is formed by ctA weight between 0 and 1 obtained via a softmax function.
Figure BDA0002716039970000133
Figure BDA0002716039970000134
In the internal memory layer MvWhen updating the state parameter value(s), the first memory content corresponding to the last video frame (i.e. the memory layer M at the last time (t-1)) also needs to be consideredvThe value of the state parameter) how much needs to be preserved at the current time t, this ratio is mu. μ is a weight between 0 and 1 derived from g through a softmax function. g is the target global feature information corresponding to the last video frame
Figure BDA0002716039970000135
And c in the calculation formulas (5) and (6)tAnd (4) determining.
Figure BDA0002716039970000136
Finally, the first memory content M corresponding to the current video framet(i.e., the memory layer M at the current time tvThe state parameter value) can be calculated by the calculation formula (10).
And S105, according to the first target characteristic information and the question characteristic information, determining answer information corresponding to the question to be answered.
As shown in fig. 4, the S105 may specifically include:
s1051, inputting at least one video characteristic information and question characteristic information into a trained question memory model for processing, so as to determine second target characteristic information related to a video to be analyzed from the question characteristic information.
The question feature information may include a plurality of word feature information, the trained question memory model may include a third sub-model and a fourth sub-model connected to the third sub-model, and the S1051 may specifically include:
and S4-1, sequentially inputting the plurality of word characteristic information into a third submodel according to the word sequence of the question to be solved, and processing to obtain second memory content corresponding to each word characteristic information.
Specifically, the S4-1 may specifically include:
s4-1-1, determining current word characteristic information from the plurality of word characteristic information according to the word sequence, and acquiring second target characteristic information and second memory content corresponding to the previous word characteristic information as second history characteristic information and second history memory content respectively.
Specifically, as shown in fig. 7, the problem memory model may further include a problem memory layer MqAnd a hidden layer hqWherein, the memory layer Mq=(m1,m2,...,mS) I.e. problem memory layer MqHas S memory states. In this embodiment, when the current word feature information is the word feature information in which the word sequence is first in the plurality of word feature information, that is, when there is no word feature information in which the word sequence is before the current word feature information in the plurality of word feature information, the video analysis apparatus may hide the hidden layer h in the problem memory modelqThe initial value of the state parameter of (a), that is,
Figure BDA0002716039970000141
the second target characteristic information corresponding to the previous word characteristic information as the current word characteristic information is used for memorizing the memory layer M in the problem memory modelqInitial value of the state parameter of (i.e., M)q 0And the second memory content is used as the second memory content corresponding to the last word characteristic information of the current word characteristic information. And, the initial value of the state parameter can be obtained by pre-training the video memory model.
And S4-1-2, inputting the current word characteristic information, the second historical characteristic information and the second historical memory content into a third submodel for processing, so that the third submodel determines the second memory content corresponding to the current word characteristic information.
Specifically, after the second memory content corresponding to the current word feature information is obtained, the second memory content corresponding to the current word feature information may be used to store the memory layer M in the problem memory modelqUpdating the plurality of state parameter values to store the second memory content in the memory layer MqIn (1).
S4-1-3, updating the second target characteristic information and the second memory content corresponding to the current word characteristic information into second historical characteristic information and second historical memory content respectively, updating the current word characteristic information by using the residual word characteristic information, and then returning to execute the step S4-1-2.
Thus, the above-mentioned S4-1-2 and S4-1-3 can form a loop, and each loop will obtain the second memory content corresponding to the updated current word feature information until the second memory contents corresponding to all the word feature information are obtained.
And S4-2, according to the word feature information, the video feature information, the second memory content and the fourth submodel, determining second target feature information related to the video to be analyzed from each word feature information.
Specifically, the above S4-2 may include:
s4-2-1, determining current word characteristic information from the word characteristic information according to the word sequence, and acquiring second memory content and second target characteristic information corresponding to the previous word characteristic information as second historical memory content and second historical characteristic information respectively.
The specific embodiment of the S4-2-1 can be found in the specific embodiment of the S4-1-1, and thus, the detailed description thereof is omitted here.
And S4-2-2, inputting the current word characteristic information, the at least one video characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing, so that the fourth submodel determines second target characteristic information related to the video to be analyzed from the current word characteristic information.
Specifically, after the second target feature information corresponding to the current word feature information is obtained, the hidden layer h in the problem memory model may be hidden by using the second target feature information corresponding to the current word feature informationqUpdating the value of the state parameter to store the second target characteristic information in the hidden informationHidden layer hqIn (1).
S4-2-3, updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into a second historical memory content and a second historical characteristic information respectively, updating the current word characteristic information by using the residual word characteristic information, and then returning to execute the step S4-2-2.
Thus, the above-mentioned S4-2-2 and S4-2-3 can form a loop, and each loop determines the second target feature information related to the video to be analyzed from the updated current word feature information until the second target feature information corresponding to all the word feature information is obtained.
The updating of the current word feature information by the remaining word feature information may be understood as updating the current word feature information by word feature information in which a word order in the remaining word feature information is one bit after the current word feature information, where the word order may be understood as a sequence of occurrence of each word in a question to be solved, for example, what the person is doing in the question to be solved, the question to be solved includes seven words, i.e., "this", "person", "now", "dry", "sh", and "so", and the sequence of occurrence may be "this", "person", "now", "dry", "sh", and "so".
Specifically, for example, in the fourth submodel, the second target feature information corresponding to the current word feature information may be obtained by calculating formulas (11) to (13)
Figure BDA0002716039970000151
Wherein, the calculation formula is as follows:
Figure BDA0002716039970000152
Figure BDA0002716039970000153
wherein, represents the inner product,
Figure BDA0002716039970000154
are learnable weights.
Figure BDA0002716039970000155
Is the second target characteristic information corresponding to the last word characteristic information of the current word characteristic information
Figure BDA0002716039970000156
And current word feature information input at current time t
Figure BDA0002716039970000157
The determined read weight. r istMemory layer M representing a memory model from the above problemsqThe content read in (1), specifically the memory layer MqA weighted sum of a plurality of memory states. Then, based on the second target characteristic information corresponding to the previous word characteristic information
Figure BDA0002716039970000158
Currently read content rtAnd current word feature information inputted at current time t
Figure BDA0002716039970000159
Dynamic characteristic information imAnd static feature information iaCalculating the current word feature information by using the calculation formula (13)
Figure BDA00027160399700001510
Corresponding second target characteristic information
Figure BDA0002716039970000161
(i.e., the hidden layer h at the current time tqThe value of the state parameter(s).
Figure BDA0002716039970000162
Accordingly, theIn the third submodel, the current word feature information can be calculated by calculating equations (14) to (16)
Figure BDA0002716039970000163
Corresponding second memory content
Figure BDA0002716039970000164
(i.e., the memory layer M at the current time tqThe state parameter value of (2), wherein the calculation formula is as follows:
Figure BDA0002716039970000165
wherein the content of the first and second substances,
Figure BDA0002716039970000166
is the current word feature information input by the current time t
Figure BDA0002716039970000167
And second target characteristic information corresponding to the previous word characteristic information
Figure BDA0002716039970000168
The determined content vector. Content vector
Figure BDA0002716039970000169
Will be used for calculating the writing weight alpha corresponding to the current word characteristic informationt,iAs shown in the calculation formula (15).
Figure BDA00027160399700001610
Figure BDA00027160399700001611
Memory layer M of the above problem memory modelqWrite weight alpha of all memory states int,iDepending on the current timeContent vector of t
Figure BDA00027160399700001612
And second target feature information corresponding to the previous word feature information.
Figure BDA00027160399700001613
Finally, current word feature information
Figure BDA00027160399700001614
Corresponding second memory content
Figure BDA00027160399700001615
(i.e., the memory layer M at the current time tqThe state parameter value) can be calculated by the calculation formula (16), wherein,
Figure BDA00027160399700001616
s is a memory layer MqThe number of memory states involved.
It can be understood that the video memory model and the problem memory model in the above embodiments are guided by cross-modal information, and can respectively realize a function of performing targeted memory on a long-term video and a problem long sentence, thereby playing an active role in understanding the long-term video and the problem long sentence. For example, as shown in fig. 8, the video to be analyzed is a wrestling game video, and in the video to be analyzed, a man first lifts an opponent, then falls on the ground and waves a fist. The memory weight of the video shown in fig. 8 indicates that the above-mentioned video memory model can solve the problem "What is a man-doing after-lifting his open patent before doing his face? The information (such as ' lifting his option ' and ' punches his face ') related to the question to be solved in the video to be analyzed is stored in a targeted manner under the guidance of ' the video to be analyzed, wherein the deeper the corresponding color of the video frame of the video to be analyzed on the memory weight bar of the video, the more the memory content of the video frame. Accordingly, the memory weight of the question shown in fig. 8 indicates that the question memory model can specifically memorize the information related to the video to be analyzed in the question to be solved under the guidance of a plurality of video frames of the video to be analyzed, wherein the deeper the corresponding color of each word of the question to be solved on the memory weight bar of the question indicates that the more the memory content of the word is.
Based on the above analysis, the video memory model and the question memory model provided by the embodiment are not limited to be applied to the field of video question answering. In some embodiments, the above-described video memory model and problem memory model may also be applied in many fields related to video understanding and cross-modal information analysis (e.g., video retrieval, video understanding, video text matching, etc.). In other embodiments, the video memory model and the question memory model may also be applied to fields such as search recommendation, so as to implement search of video content according to characters, or search of corresponding texts according to videos.
And S1052, determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.
Wherein, the S1052 may specifically include:
s5-1, obtaining a first target characteristic matrix according to first target characteristic information corresponding to a plurality of video frames, and obtaining a second target characteristic matrix according to second target characteristic information corresponding to a plurality of word characteristic information.
Specifically, when the first target feature information includes target dynamic feature information, target static feature information, and target global feature information, the hidden layer h in the video memory model may be usedvConnected to a first target feature matrix VidTo obtain the target global feature information (i.e. hidden layer h at each time) corresponding to all video framesvState parameter values) of the first target feature matrix Vid. Accordingly, the hidden layer h in the problem memory model can be usedqConnected to a second target feature matrix TexTo obtain the second target feature information (i.e. hidden layer h at each time) corresponding to all the word feature informationqState of (1)Parameter values) of the second target feature matrix Tex
S5-2, inputting the first target characteristic matrix into the trained first self-attention model for processing to obtain first semantic remote dependence information of the first target characteristic information, and inputting the second target characteristic matrix into the trained second self-attention model for processing to obtain second semantic remote dependence information of the second target characteristic information.
Specifically, the first and second self-attention models described above may be self-attention models based on a scaled dot product attention mechanism as shown in equation (17).
Figure BDA0002716039970000171
Where Q, K and V represent query (query), key (key), and value (value), respectively.
In one embodiment, the first self-attention model described above may be as shown in equation (18).
V0=Attention(VidWq,VidWk,VidWv) (18)
The second self-attention model described above may be as shown in equation (19).
Figure BDA0002716039970000181
Where W is a learnable parameter. Specifically, the purpose of the first self-attention model is to extract a semantic remote dependency relationship of the concerned video feature information from the video memory model to obtain corresponding first semantic remote sequential information, and the purpose problem memory model of the second self-attention model is to extract a semantic remote dependency relationship of the concerned problem feature information to obtain corresponding second semantic remote sequential information. In this way, by using a self-attention model based on the scaled dot product attention mechanism as a non-local network model, the global dependency relationship of the feature information output by the memory model can be extracted better. For example, as shown in fig. 8, the first self-attention model and the second self-attention model respectively focus on the global dependency relationship of the feature information output by the video memory model and the question memory model to correspondingly obtain the self-attention weight of the video and the self-attention weight of the question, and further determine the correct answer "throw", where a deeper color corresponding to the self-attention weight bar of the video frame of the video to be analyzed indicates that the video frame is more important, and a deeper color corresponding to each word of the question to be solved on the self-attention weight bar of the question indicates that the word is more important.
And S5-3, according to the first semantic remote dependency information and the second semantic remote dependency information, determining answer information corresponding to the question to be answered.
Wherein, the S5-3 may specifically include:
and S5-3-1, fusing the first semantic remote dependency information and the second semantic remote dependency information by using the trained mutual attention model to obtain final characteristic information.
Specifically, the attention a of the video to be analyzed to the question to be solved can be calculated by the following calculation formula (20):
Figure BDA0002716039970000182
also, the attention B of the question to be solved to the video to be analyzed can be calculated by the following calculation formula (21):
Figure BDA0002716039970000183
wherein S is a weight matrix,
Figure BDA0002716039970000184
wherein
Figure BDA0002716039970000185
Then, the final feature information O for answer prediction may be generated using the trained mutual attention model as shown in equation (22).
O=Concat(V0,A,V0⊙A,V0⊙B) (22)
Where V is a learnable parameter, which indicates the product between elements, and Concat is a splicing function.
And S5-3-2, inputting the final characteristic information into the trained classification model to obtain an answer corresponding to the question to be solved.
The classification model may be a softmax classifier. In addition, in implementation, the classification model can be optimized by using a hinge loss function (for the case that the problem to be solved is a multi-item selection problem) or a cross entropy loss function (for the case that the problem to be solved is an open problem).
It should be noted that, in this embodiment, an implementation that a video is specifically memorized by taking a question as a guidance is illustrated, but the video is not limited to be memorized by taking a question as a guidance, and in some embodiments, other information related to a video question-answering task may also be used as a guidance to memorize a video. In addition, the video memory model and the question memory model provided in this embodiment can effectively learn the global context awareness information from the video to be analyzed and the question to be answered, and both have larger storage thresholds, that is, stronger storage capabilities, compared with the existing memory model.
As can be seen from the above, in the video analysis method provided in this embodiment, the video to be analyzed and the problem to be solved related to the video to be analyzed are obtained, then at least one piece of video feature information corresponding to the video to be analyzed is determined, the problem feature information corresponding to the problem to be solved is determined, then the at least one piece of video feature information and the at least one piece of problem feature information are input into the trained video memory model for processing, so as to determine the first target feature information related to the problem to be solved from the video feature information, and then the answer information corresponding to the problem to be solved is determined according to the first target feature information and the problem feature information, so that when the video is subjected to semantic understanding analysis, the video can be pertinently memorized by using the problem as guidance, and further, the memory effect of the long-term video is improved.
Based on the method in the foregoing embodiment, the present embodiment will be further described from the perspective of a video analysis apparatus, please refer to fig. 9, where fig. 9 specifically describes the video analysis apparatus provided in the present embodiment, which may include: an obtaining module 610, a first determining module 620, a second determining module 630, a third determining module 640, and a fourth determining module 650, wherein:
(1) acquisition Module 610
The obtaining module 610 is configured to obtain a video to be analyzed and a question to be solved related to the video to be analyzed.
(2) First determination module 620
The first determining module 620 is configured to determine at least one type of video feature information corresponding to a video to be analyzed.
The first determining module 620 specifically includes:
an extraction unit for extracting a plurality of video frames from a video to be analyzed;
the first determining unit is used for determining at least one type of video characteristic information corresponding to each video frame.
(3) Second determination module 630
The second determining module 630 is configured to determine question feature information corresponding to the question to be solved.
(4) Third determining module 640
And a third determining module 640, configured to input at least one of the video feature information and the question feature information into a trained video memory model for processing, so as to determine, from the video feature information, first target feature information related to the question to be solved.
The trained video memory model may include a first sub-model and a second sub-model, and the third determining module 640 may specifically include:
the second determining unit is used for sequentially inputting the at least one type of video characteristic information corresponding to the plurality of video frames into the first submodel according to the time sequence for processing so as to obtain first memory content corresponding to each video frame;
and the third determining unit is used for determining first target characteristic information related to the problem to be solved from the video characteristic information corresponding to each video frame according to at least one video characteristic information, the problem characteristic information, the first memory content and the second submodel corresponding to the plurality of video frames.
Specifically, the third determining unit may be configured to perform:
determining a current video frame from a plurality of video frames according to a time sequence, and acquiring first memory content and first target characteristic information corresponding to a previous video frame as first historical memory content and first historical characteristic information respectively;
inputting at least one video characteristic information, question characteristic information, first historical memory content and first historical characteristic information corresponding to a current video frame into a second submodel for processing, so that the second submodel determines first target characteristic information related to a question to be answered from the at least one video characteristic information corresponding to the current video frame;
and then the third determining unit returns to re-execute at least one of video characteristic information, problem characteristic information, first historical memory content and first historical characteristic information corresponding to the current video frame to be input into the second submodel for processing.
In a specific embodiment, the at least one video feature information may include dynamic feature information and static feature information, the first target feature information may include target dynamic feature information, target static feature information, and target global feature information, and the second sub-model may specifically perform, when determining the first target feature information related to the question to be solved from the at least one video feature information corresponding to the current video frame, the following steps:
according to the dynamic characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the question to be answered from the dynamic characteristic information corresponding to the current video frame;
according to the static characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target static characteristic information related to the question to be answered from the static characteristic information corresponding to the current video frame;
and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the first historical feature information and the question feature information.
(5) Fourth determination Module 650
The fourth determining module 650 is configured to determine answer information corresponding to the question to be answered according to the first target feature information and the question feature information.
The fourth determining module 650 may specifically include:
the fourth determining unit is used for inputting at least one type of video characteristic information and problem characteristic information into the trained problem memory model for processing so as to determine second target characteristic information related to the video to be analyzed from the problem characteristic information;
and the fifth determining unit is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.
In an embodiment, the question feature information casing contains a plurality of word feature information, the trained question memory model may include a third submodel and a fourth submodel, and the fourth determining unit may specifically include:
the first determining subunit is used for sequentially inputting the plurality of word characteristic information into the third submodel according to the word sequence of the question to be solved and processing the word characteristic information to obtain second memory content corresponding to each word characteristic information;
and the second determining subunit is used for determining second target characteristic information related to the video to be analyzed from each word characteristic information according to the plurality of word characteristic information, the at least one video characteristic information, the second memory content and the fourth submodel.
Specifically, the second determining subunit may be configured to perform:
determining current word characteristic information from the plurality of word characteristic information according to the word sequence, and acquiring second memory content and second target characteristic information corresponding to the previous word characteristic information as second historical memory content and second historical characteristic information respectively;
inputting the current word feature information, at least one video feature information, second historical memory content and second historical feature information into a fourth submodel for processing, so that the fourth submodel determines second target feature information related to the video to be analyzed from the current word feature information;
and respectively updating the second memory content and the second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information, updating the current word characteristic information by using the residual word characteristic information, and then returning to the second determining subunit to re-execute the input of the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into a fourth submodel for processing.
In another embodiment, the fifth determining unit may specifically include:
the third determining subunit is configured to obtain a first target feature matrix according to the first target feature information corresponding to the multiple video frames, and obtain a second target feature matrix according to the second target feature information corresponding to the multiple word feature information;
the fourth determining subunit is configured to input the first target feature matrix into the trained first self-attention model for processing to obtain first semantic remote dependency information of the first target feature information, and input the second target feature matrix into the trained second self-attention model for processing to obtain second semantic remote dependency information of the second target feature information;
and the fifth determining subunit is used for determining answer information corresponding to the question to be solved according to the first semantic remote dependency information and the second semantic remote dependency information.
In specific implementation, each of the foregoing sub-units, and modules may be implemented as an independent entity, or may be combined arbitrarily and implemented as one or several entities, and specific implementations of each of the foregoing sub-units, and modules may refer to the foregoing method embodiments, and are not described herein again.
As can be seen from the above, the video analysis apparatus provided in this embodiment includes an obtaining module, configured to obtain a video to be analyzed and a question to be solved related to the video to be analyzed; the first determining module is used for determining at least one video characteristic information corresponding to a video to be analyzed; the second determination module is used for determining question characteristic information corresponding to the question to be solved; the third determining module is used for inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information; and the fourth determining module is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information, so that when the video is semantically understood, the video can be subjected to pointed memory by taking the question as guidance, and the memory effect of the long-term video is further improved.
Correspondingly, an embodiment of the present application further provides a server, where the server may be a single server, or may be a server cluster composed of multiple servers, as shown in fig. 10, which shows a schematic structural diagram of a server according to an embodiment of the present application, and specifically:
the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, Radio Frequency (RF) circuitry 403, a power supply 404, an input unit 405, and a display unit 406. Those skilled in the art will appreciate that the server architecture shown in FIG. 10 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The RF circuit 403 may be used for receiving and transmitting signals during information transmission and reception, and in particular, for receiving downlink information of a base station and then processing the received downlink information by the one or more processors 401; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 403 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 403 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The server also includes a power supply 404 (e.g., a battery) for powering the various components, and preferably, the power supply 404 is logically connected to the processor 401 via a power management system, so that functions such as managing charging, discharging, and power consumption are performed via the power management system. The power supply 404 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The server may further include an input unit 405, and the input unit 405 may be used to receive input numeric or character information and generate a keyboard, mouse, joystick, optical or trackball signal input in relation to user settings and function control. Specifically, in one particular embodiment, input unit 405 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 401, and can receive and execute commands sent by the processor 401. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 405 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The server may also include a display unit 406, and the display unit 406 may be used to display information input by or provided to the user as well as various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 406 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 401 to determine the type of the touch event, and then the processor 401 provides a corresponding visual output on the display panel according to the type of the touch event. Although in FIG. 10 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.
Although not shown, the server may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:
acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;
determining at least one video characteristic information corresponding to a video to be analyzed;
determining problem characteristic information corresponding to a problem to be solved;
inputting at least one type of video characteristic information and question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to a question to be solved from the video characteristic information;
and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
The server may implement the effective effect that any one of the video analysis apparatuses provided in the embodiments of the present application can implement, which is described in detail in the foregoing embodiments and is not described herein again.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The foregoing detailed description is directed to a video analysis method, a video analysis device, and a storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the methods and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method of video analysis, comprising:
acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;
determining at least one video characteristic information corresponding to the video to be analyzed;
determining question characteristic information corresponding to the question to be solved;
inputting the at least one type of video characteristic information and the question characteristic information into a trained video memory model for processing so as to determine first target characteristic information related to the question to be solved from the video characteristic information;
and determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
2. The video analysis method according to claim 1, wherein the trained video memory model includes a first submodel and a second submodel, and the determining at least one video feature information corresponding to the video to be analyzed specifically includes:
extracting a plurality of video frames from the video to be analyzed;
determining at least one video characteristic information corresponding to each video frame;
the inputting the at least one piece of video feature information and the question feature information into a trained video memory model for processing to determine first target feature information related to the question to be answered from the video feature information specifically includes:
sequentially inputting the at least one video characteristic information corresponding to the plurality of video frames into the first submodel according to a time sequence for processing so as to obtain a first memory content corresponding to each video frame;
and according to the at least one type of video characteristic information, the question characteristic information, the first memory content and the second submodel corresponding to the plurality of video frames, determining first target characteristic information related to the question to be solved from the video characteristic information corresponding to each video frame.
3. The video analysis method according to claim 2, wherein the determining, according to the at least one of the video feature information, the question feature information, the first memory content, and the second submodel corresponding to the plurality of video frames, first target feature information related to the question to be solved from the video feature information corresponding to each of the video frames specifically comprises:
determining a current video frame from the plurality of video frames according to the time sequence, and acquiring first memory content and first target characteristic information corresponding to a previous video frame as first historical memory content and first historical characteristic information respectively;
inputting the at least one video characteristic information, the question characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into the second submodel for processing, so that the second submodel determines first target characteristic information related to the question to be solved from the at least one video characteristic information corresponding to the current video frame;
updating the first memory content and the first target characteristic information corresponding to the current video frame into the first historical memory content and the first historical characteristic information respectively, updating the current video frame by using the residual video frames, and then returning to execute the step of inputting the at least one video characteristic information, the problem characteristic information, the first historical memory content and the first historical characteristic information corresponding to the current video frame into the second submodel for processing.
4. The video analysis method according to claim 3, wherein the at least one type of video feature information includes dynamic feature information and static feature information, the first target feature information includes target dynamic feature information, target static feature information and target global feature information, and the determining, from the at least one type of video feature information corresponding to the current video frame, first target feature information related to the question to be solved specifically includes:
according to the dynamic characteristic information, the first historical memory content, the first historical characteristic information and the question characteristic information corresponding to the current video frame, determining target dynamic characteristic information related to the question to be answered from the dynamic characteristic information corresponding to the current video frame;
according to the static feature information, the first historical memory content, the first historical feature information and the question feature information corresponding to the current video frame, determining target static feature information related to the question to be answered from the static feature information corresponding to the current video frame;
and determining target global feature information corresponding to the current video frame and related to the question to be solved according to the first historical memory content, the first historical feature information and the question feature information.
5. The video analysis method according to claim 2, wherein the determining answer information corresponding to the question to be answered according to the first target feature information and the question feature information specifically comprises:
inputting the at least one video characteristic information and the problem characteristic information into a trained problem memory model for processing so as to determine second target characteristic information related to the video to be analyzed from the problem characteristic information;
and determining answer information corresponding to the question to be answered according to the first target characteristic information and the second target characteristic information.
6. The video analysis method according to claim 5, wherein the question feature information includes a plurality of word feature information, the trained question memory model includes a third sub-model and a fourth sub-model, and the inputting the at least one of the video feature information and the question feature information into the trained question memory model for processing to determine a second target feature information related to the video to be analyzed from the question feature information specifically includes:
sequentially inputting the plurality of word characteristic information into the third submodel according to the word sequence of the question to be solved for processing to obtain second memory content corresponding to each word characteristic information;
and determining second target characteristic information related to the video to be analyzed from each word characteristic information according to the plurality of word characteristic information, the at least one video characteristic information, the second memory content and the fourth submodel.
7. The video analysis method according to claim 6, wherein the determining, from each of the word feature information, second target feature information related to the video to be analyzed according to the word feature information, the at least one piece of video feature information, the second memory content, and the fourth submodel, specifically comprises:
determining current word feature information from the plurality of word feature information according to the word sequence, and acquiring second memory content and second target feature information corresponding to the previous word feature information as second historical memory content and second historical feature information respectively;
inputting the current word feature information, the at least one video feature information, the second historical memory content and the second historical feature information into the fourth submodel for processing, so that the fourth submodel determines second target feature information related to the video to be analyzed from the current word feature information;
and updating second memory content and second target characteristic information corresponding to the current word characteristic information into second historical memory content and second historical characteristic information respectively, updating the current word characteristic information by using residual word characteristic information, and returning to execute the step of inputting the current word characteristic information, the problem characteristic information, the second historical memory content and the second historical characteristic information into the fourth submodel for processing.
8. The video analysis method according to claim 6, wherein the determining answer information corresponding to the question to be answered according to the first target feature information and the second target feature information specifically comprises:
obtaining a first target characteristic matrix according to the first target characteristic information corresponding to the plurality of video frames, and obtaining a second target characteristic matrix according to the second target characteristic information corresponding to the plurality of word characteristic information;
inputting the first target characteristic matrix into a trained first self-attention model for processing to obtain first semantic remote dependency information of the first target characteristic information, and inputting the second target characteristic matrix into a trained second self-attention model for processing to obtain second semantic remote dependency information of the second target characteristic information;
and determining answer information corresponding to the question to be answered according to the first semantic remote dependency information and the second semantic remote dependency information.
9. A video analysis apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video to be analyzed and a problem to be solved related to the video to be analyzed;
the first determination module is used for determining at least one video characteristic information corresponding to the video to be analyzed;
the second determination module is used for determining question characteristic information corresponding to the question to be solved;
a third determining module, configured to input the at least one piece of video feature information and the question feature information into a trained video memory model for processing, so as to determine, from the video feature information, first target feature information related to the question to be solved;
and the fourth determining module is used for determining answer information corresponding to the question to be answered according to the first target characteristic information and the question characteristic information.
10. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor for performing the steps of the video analysis method according to any one of claims 1 to 8.
CN202011073795.0A 2020-10-09 2020-10-09 Video analysis method, device and storage medium Pending CN113392686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011073795.0A CN113392686A (en) 2020-10-09 2020-10-09 Video analysis method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011073795.0A CN113392686A (en) 2020-10-09 2020-10-09 Video analysis method, device and storage medium

Publications (1)

Publication Number Publication Date
CN113392686A true CN113392686A (en) 2021-09-14

Family

ID=77616518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011073795.0A Pending CN113392686A (en) 2020-10-09 2020-10-09 Video analysis method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113392686A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US11977851B2 (en) Information processing method and apparatus, and storage medium
CN108009521B (en) Face image matching method, device, terminal and storage medium
US20210224601A1 (en) Video sequence selection method, computer device, and storage medium
CN111553162A (en) Intention identification method and related device
CN110472002B (en) Text similarity obtaining method and device
CN111709398A (en) Image recognition method, and training method and device of image recognition model
CN113723378B (en) Model training method and device, computer equipment and storage medium
CN112203115B (en) Video identification method and related device
CN113821720A (en) Behavior prediction method and device and related product
CN115131604A (en) Multi-label image classification method and device, electronic equipment and storage medium
CN114357278A (en) Topic recommendation method, device and equipment
CN112995757B (en) Video clipping method and device
KR102353687B1 (en) Server for providing service for educating english and method for operation thereof
CN112862021A (en) Content labeling method and related device
CN112907255A (en) User analysis method and related device
CN113392686A (en) Video analysis method, device and storage medium
CN112488157A (en) Dialog state tracking method and device, electronic equipment and storage medium
CN114462539A (en) Training method of content classification model, and content classification method and device
CN115328303A (en) User interaction method and device, electronic equipment and computer-readable storage medium
CN111611369B (en) Interaction method and related device based on artificial intelligence
CN110750193B (en) Scene topology determination method and device based on artificial intelligence
CN113569043A (en) Text category determination method and related device
CN116453005A (en) Video cover extraction method and related device
CN111723783A (en) Content identification method and related device
CN111612280A (en) Data analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination