CN111652202B - Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model - Google Patents

Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model Download PDF

Info

Publication number
CN111652202B
CN111652202B CN202010795917.0A CN202010795917A CN111652202B CN 111652202 B CN111652202 B CN 111652202B CN 202010795917 A CN202010795917 A CN 202010795917A CN 111652202 B CN111652202 B CN 111652202B
Authority
CN
China
Prior art keywords
space
video
time
target
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010795917.0A
Other languages
Chinese (zh)
Other versions
CN111652202A (en
Inventor
赵洲
何金铮
金韦克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010795917.0A priority Critical patent/CN111652202B/en
Publication of CN111652202A publication Critical patent/CN111652202A/en
Application granted granted Critical
Publication of CN111652202B publication Critical patent/CN111652202B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention discloses a method and a system for solving a video question and answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model, and belongs to the field of video question and answer text generation. First, for a set of videos, question, answer training sets, information at the target level for each video frame is obtained using a target detector. Secondly, for the information at the target level, the dynamic expression of the target is learned by using an adaptive space-time diagram model. And finally, learning the relation between the visual information and the text information by using a Transformer model, and enhancing the performance of the visual question answering. Compared with a general video question-and-answer solution, the video question-and-answer method and the video question-and-answer system have the advantages that the space-time dynamic information of the target is better acquired by using the self-adaptive space-time diagram model, meanwhile, the same objects of different video frames are tried to be connected, the dynamic information is better captured, the video-language model is improved by adopting the picture-language data for pre-training, and the effect of solving the video question-and-answer problem is improved.

Description

Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model
Technical Field
The invention relates to the field of video question and answer text generation, in particular to a method and a system for solving a video question and answer problem by improving video-language representation learning through a self-adaptive space-time diagram model.
Background
A hot spot in visual language research is the understanding of visual content, language semantics and their interrelationships. Video question answering is one of the typical tasks. Recently, some BERT-style visual language pre-training methods have been proposed and shown to be effective at various tasks. In this work, the present invention also solved the problem of video question answering with a successful visual language Transformer.
In the existing technical solutions, for example, ViLBERT and LXMBERT both use mask technology based on intra-modality or cross-modality relationships for pre-training, and this training method is very similar to the BERT method. However, the existing labeled video data is very little, and the pre-training requires a large amount of data, so the above method is not ideal. To address the data problem, the visualBERT and CBT approaches attempt to use large amounts of unlabeled data on video websites for self-supervised pre-training. However, as the visual features of the videos are more dynamic and diverse, enough structured information is not available, so that the pre-training effect is not ideal. And such pre-training requires a large amount of computational resources, which is difficult to do with only a few gpu.
In addition, the model in the prior art usually only pays attention to space modeling or time modeling, lacks a space-time relationship, and is insufficient in modeling. And the relation between the first frame and the last frame is usually only concerned in the time modeling, and the application effect of the long video is poor.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and provides a method and a system for solving the video question-answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model.
The invention discloses a method for solving a video question-answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model, which comprises the following steps.
1) For a section of video, extracting target level features in each video frame by adopting a target detection technology, and combining the target level features to obtain initial region features in the video frames.
2) And constructing a space-time diagram model consisting of a plurality of layers of space-time diagrams, wherein each layer of space-time diagram comprises a space diagram model and a timing diagram model, and the space diagram model is adopted to perform space updating on the region characteristics.
And constructing an anchor pipe corresponding to each target area in the video frame, updating in sequence according to the video frame, arranging targets in the anchor pipes according to a time sequence to form a time-space pipe, constructing a time sequence diagram by using the targets in the time-space pipe, and updating the time sequence of the area characteristics after the space updating.
Taking the initial region characteristics obtained in the step 1) as the input of a first layer of space-time diagram, and taking the region characteristics after the time sequence update output by the previous layer of space-time diagram as the input of the next layer of space-time diagram to form a space-time diagram model consisting of multiple layers of space-time diagrams; and the output of the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding.
3) Constructing a video-language Transformer model, comprising the space-time diagram model and the Transformer model in the step 2), taking the question sentence and the representation of the video pipe level output in the step 2) as the input of the Transformer model, and training the video-language Transformer model according to the standard answer of the question.
4) And aiming at the question sentences to be processed, directly obtaining answers of the questions to be answered by using a trained video-language Transformer model.
It is another object of the present invention to provide a system for implementing the above method.
The method specifically comprises the following steps:
a target detection module: the method is used for detecting target level features in each video frame, marking out candidate frames and outputting target position features, target label features and region geometric features.
Initial region feature combination module: and the system is used for combining the target level characteristics output by the target detection module to obtain initial region characteristics.
A multi-level space-time diagram module: and configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to time sequence updating processing once.
GRU coding module: and the method is used for coding the timing sequence outputted by the multi-level space-time diagram module to obtain the representation of the video tube level.
A Transformer module: and a Transformer model is configured, and two input ports of the Transformer model respectively read the question sequence and the representation of the video tube level output by the GRU coding module and output the answer of the question to be solved.
A parameter updating module: and the method is used for updating the parameters of the multi-level space-time diagram module and the Transformer module in the training phase.
Compared with the prior art, the invention has the following beneficial effects.
(1) The invention utilizes an image language pre-training Transformer module to help video language modeling. Because the existing video language data and the pre-training model are very rich, the marked structured data in the existing video-language data is very little, and the pre-training requires very large computing resources, the invention solves the defects of unsatisfactory pre-training effect and extremely high resource consumption in the traditional method of using the video-language data, and the invention achieves the great improvement of the effect.
(2) The invention adopts the self-adaptive space-time diagram model to model the space-time relation of the target, the traditional method usually only pays attention to space modeling or time modeling alone, and does not integrate the space-time modeling.
(3) In the processes of updating an anchor pipe and establishing a spatio-temporal pipe, the traditional method usually mainly focuses on a first frame and a last frame, neglects intermediate frame information, and has great influence on a longer video. The invention adopts a frame-by-frame updating method, successfully models the target information of the intermediate frame by means of designing a threshold value and the like, and ensures the application effect on the long video.
Drawings
FIG. 1 is a schematic overall flow chart of the method for solving the video question-answer problem by promoting video-language representation learning by using the adaptive space-time diagram model according to the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for solving the video question-answer problem by improving video-language representation learning by using an adaptive space-time diagram model mainly includes the following steps.
1) For the input video, the target level feature of each video frame is extracted by using a target detection technology, and the initial region feature is further obtained.
2) And updating the initial region characteristics by using an adaptive space-time diagram model to learn dynamic target characteristics to obtain the characteristics of the video tube level.
3) Constructing a video-language Transformer model, comprising the spatio-temporal graph model and the Transformer model in the step 2), taking the question sentences and the representation of the video tube level output in the step 2) as the input of the Transformer model, simultaneously symbolizing the questions to be solved and then taking the symbolized questions as the input, and training the video-language Transformer model according to the standard answers of the questions.
4) And aiming at the question sentences to be processed, directly obtaining answers of the questions to be answered by using a trained video-language Transformer model.
In one embodiment of the present invention, a video-language Transformer model is constructed in which the Transformer model is pre-trained using a picture-language dataset. The output of the Transformer model is subjected to a multi-layer perceptron and standard answers together to realize the training of the video-language Transformer model.
In one specific implementation of the present invention, the initial region feature obtained in step 1) is obtained by the target detector and the feature combination moduleFirst, the following features are generated by the target detector: location features of objects
Figure GDA0002725040150000041
Object tag feature
Figure GDA0002725040150000042
And geometric characteristics of the region
Figure GDA0002725040150000043
Where d represents the feature size, c represents the class of the object, and l represents the geometric information. Passing the above characteristics through the formula
Figure GDA0002725040150000044
And (4) combining. Wherein W1,W2,W3,W4Are all weight matrices. N is a radical oftIndicating that the t-th frame has N objects, the subscript N indicating the number of objects, the superscript t indicating the t-th frame in the video,
Figure GDA0002725040150000045
is the position characteristic of the ith target in the t frame in the video,
Figure GDA0002725040150000046
is the tag feature of the ith object in the tth frame in the video,
Figure GDA0002725040150000047
is the geometric feature of the ith object in the tth frame in the video,
Figure GDA0002725040150000048
is the combined characteristic of the ith target of the t frame in the video, namely the initial region characteristic.
In one embodiment of the present invention, the characterization at the video tube level is obtained by:
2.1) constructing a space map model, calculating the similarity between every two regions in each video frame, and obtaining a similarity matrix corresponding to each video frame; and updating the space of the region features.
2.2) constructing an anchor pipe and a space-time pipe:
the present invention attempts to relate the same objects of different video frames, a structure known as space-time tube. The invention constructs a space-time tube by using an anchor tube, wherein the anchor tube refers to a sequence formed by the same target in different video frames. An anchor pipe set refers to a set of multiple different anchor pipes.
Extracting a target area in a first video frame to initialize an anchor pipe set; calculating similarity scores between the targets in the current video frame and the targets in the anchor pipe, if the similarity scores are larger than a threshold value, adding the targets in the current video frame into the corresponding anchor pipe, and if not, adding the targets in the current video frame into the anchor pipe set as new anchor pipes; and arranging the target sequences in each anchor pipe according to a time sequence to form a time-space pipe.
And 2.3) constructing a time sequence graph model according to the updated space-time tubes, and constructing a time sequence graph by using targets in each space-time tube.
2.4) constructing a graph convolution neural network on the basis of the time sequence diagram obtained in the step 2.3), and carrying out time sequence updating on the region characteristics after the space updating according to the region characteristics after the space updating in the step 2.1) and the similarity score obtained in the step 2.2).
2.5) forming a layer of space-time diagram in the steps 2.1) to 2.4), repeating the step, and taking the region characteristics after the time sequence output by the previous layer of space-time diagram is updated as the input of the next layer of space-time diagram to form a space-time diagram model consisting of a plurality of layers of space-time diagrams; and the output of the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding. 2.1) -2.4), the whole process is repeated for M times, M is the number of layers of the adaptive space-time diagram model, and M is preferably 2.
The similarity in the step 2.1) is the similarity between different areas in the same video frame and is obtained by adopting trainable matrix calculation; the similarity score in the step 2.2) is the similarity between different areas in different video frames, and is obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity.
Specifically, the method comprises the following steps:
for each sampled video frame, to obtain the association between each two objects, the present invention constructs a spatial map in which highly correlated instances are given a higher confidence value. To achieve this, the invention adopts
Figure GDA0002725040150000051
The similarity between each two regions is described. Where W ', W' is a trainable matrix, the indices i and j denote the ith, jth object, and the superscript t denotes the tth frame. The result of the similarity calculation constitutes a similarity matrix Gs,GsRegularization is performed by the softmax function. Then, the invention adopts a graph convolution neural network on the space graph and utilizes the space updating formula Vs=GsPsWsTo update the target feature. Wherein, VsFor spatially updated regional features, GsWhich is a similarity matrix, can be viewed as an adjacency matrix. PsAdopting the initial region characteristics in an initial state for the region characteristics output by the previous layer of space-time diagram; wsIs a learnable weight matrix.
The anchor tube set is initialized with the corresponding target region of the first frame. The anchor pipe set is then dynamically updated based on the objects within the video. To achieve this, the present invention uses the following formula to calculate the similarity of two target regions in different frames.
sim(i,j)=(Sa(i,j)+Sr(i,j)+Sg(i,j))/3
Wherein Sa(i, j) represents the similarity of visual manifestations, Sr(i, j) represents the similarity of two regions in view of critical features. Sg(i, j) represents the similarity of the spatial geometry of the two target regions. Using the similarity scores, a time-space tube for each target is constructed. Starting from the second frame, for each region, the current region is compared to the region inside the anchor tube, and a similarity score is calculated. At the same time, a threshold is set, and if a certain threshold is exceeded, the current region is added toIn the anchor tube.
Wherein S isaThe calculation formula of (i, j) is:
Figure GDA0002725040150000061
wherein L represents the Euclidean distance,
Figure GDA0002725040150000062
representing the regional characteristics of the ith region of the tth frame.
Figure GDA0002725040150000063
Representing the region characteristics of the jth region of the t' th frame.
SrThe calculation formula of (i, j) is:
Figure GDA0002725040150000064
wherein the content of the first and second substances,
Figure GDA0002725040150000065
is the feature after convolution updating of the ith target image of the tth frame,
Figure GDA0002725040150000066
representing the updated features of the jth region of the t' th frame.
SgThe calculation formula of (i, j) is:
Sg(i,j)=(Siou(i,j)+Sarea(i,j))/2
Figure GDA0002725040150000067
Figure GDA0002725040150000068
wherein, area represents the space characteristic of the target area, and Ar represents the area of the target area.
In the time sequence updating process, the areas in each time-space tube are arranged according to the time sequence and are connected into an undirected graph to form a time sequence graph. A graph convolution neural network is constructed on the timing diagram. Using the similarity score obtained in the step 2.3) as an adjacency matrix Gt. The graph convolution neural network passes through the formula
Vt=GtPtWt
And updating the time sequence. Wherein, VtFor the region feature after the time sequence update, GtIs a similarity score, PtIs the region feature after spatial update in the space-time diagram of the layer, WtIs a trainable matrix.
In another embodiment of the present invention, a system for solving video question-and-answer problems by enhancing video-language representation learning using an adaptive spatiotemporal graph model is presented.
The method comprises the following steps:
a target detection module: the method is used for detecting target level features in each video frame, marking out candidate frames and outputting target position features, target label features and region geometric features.
Initial region feature combination module: the system comprises a target detection module, a target level detection module and a target level detection module, wherein the target level detection module is used for detecting target level characteristics of a target; the combination mode can be set according to needs, and the preferable combination mode is as follows:
Figure GDA0002725040150000071
wherein
Figure GDA0002725040150000072
Showing the initial regional characteristics of the ith target of the combined tth frame,
Figure GDA0002725040150000073
respectively representing the target position feature, the target label feature and the region geometric feature output by the target detection module.
A multi-level space-time diagram module: and configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to timing diagram updating processing once.
GRU coding module: and the method is used for coding the timing sequence outputted by the multi-level space-time diagram module to obtain the representation of the video tube level.
A Transformer module: and a Transformer model is configured, and two input ports of the Transformer model respectively read the question sequence and the representation of the video tube level output by the GRU coding module and output the answer of the question to be solved.
A parameter updating module: and the method is used for updating the parameters of the multi-level space-time diagram module and the Transformer module in the training phase.
Wherein, the multi-level space-time diagram module further comprises:
and (3) a space diagram model: the similarity matrix is used for calculating the similarity between every two areas in each video frame to obtain a similarity matrix corresponding to each video frame; and updating the space of the region features.
The anchor pipe storage module: the anchor pipe is used for storing the corresponding anchor pipe of each target in the video frame.
The anchor pipe updating module: the system comprises an anchor pipe storage module, a video acquisition module and a video acquisition module, wherein the anchor pipe storage module is used for storing the target in the current video frame and the target in the anchor pipe storage module; specifically, if the similarity score is greater than the threshold, the target in the current video frame is added to the corresponding anchor pipe, otherwise, the target in the current video frame is added to the anchor pipe set as a new anchor pipe.
The space-time tube storage module: the method is used for arranging the target sequences in each anchor pipe according to time sequence, and can output a time sequence diagram according to the instruction, and particularly, only the last layer of space-time diagram in the space-time diagram model needs to output a final time sequence diagram according to the instruction.
The anchor pipe storage module, the anchor pipe updating module and the space-time pipe storage module are integrated into an integrated time sequence diagram model.
The similarity score calculator in the spatial graph model is used for calculating the similarity between different areas in the same video frame and is obtained by adopting trainable matrix calculation, an arbitrary value can be given as an initial value for setting a weight matrix in a calculation formula, and an optimal value can be obtained by learning in the training process.
The similarity score calculator in the time sequence graph model is used for calculating the similarity between different areas in different video frames, and can be obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity, or calculating other weight values.
In the specific embodiments provided in the present application, it should be understood that the above-described system embodiments are merely illustrative, and for example, the multi-level space-time diagram module may be a logical functional partition, and may have another partition in actual implementation, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.
Examples
In order to further demonstrate the implementation effect of the present invention, the present embodiment tests the overall algorithm of the adaptive space-time diagram model on a plurality of training sets.
Table 1: algorithm of adaptive space-time diagram model entirety
Figure GDA0002725040150000081
For pre-training of image languages, the model referred to in the present invention is pre-trained on a conceptual header dataset, which has approximately 300 million image header pairs, an order of magnitude larger than the COCO header dataset. It also includes broader pictures and descriptive text. Therefore, it is well suited for image language pre-training tasks. For the video language task, the invention adopts two widely used video answering data sets MSVD-QA and MSRVTTQA. The two data sets are from the video subtitle data set MSVD and the MSRVTT data set, respectively. The MSVD-QA dataset consists of 1970 video segments and 50,505 answers to questions. The MSRVTT-QA dataset is much larger and consists of 10k video clips and 243k challenge-response pairs. As shown in tables 2-3, both data sets contain five general types of questions, what, who, how, when, and where. Of these, the "what" and "who" types of questions account for a major proportion.
In order to objectively evaluate the performance of the algorithm of the present invention, the present invention proposes two versions, a base version and a full version, respectively. The basic version is trained using only the video answer dataset, while the full version is pre-trained using the picture-language dataset. The results of the present invention on the MSVD-QA and MSRVTT-QA datasets are presented below.
TABLE 2 results of the MSVD-QA test for which the present invention is directed
Type of problem What Who How When Where All
Basic edition 22.8 54.0 74.3 72.4 53.6 35.4
Full edition 24.6 53.6 75.7 70.7 53.6 36.3
TABLE 3 results of the MSRVTT-QA test for which the present invention is directed
Type of problem What Who How When Where All
Basic edition 30.0 46.8 82.7 77.3 36.0 36.3
Full edition 31.1 48.5 81.2 77.7 34.0 37.6
All the data represent the accuracy of the answers, and it can be seen that the full version is improved better than the basic version, which indicates that the pre-training process of the picture-language data set is very effective. The five types of questions of 'what', 'who', 'how', 'when' and 'where' all achieve high effect.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (7)

1. A method for solving video question-answering problems by improving video-language representation learning by utilizing an adaptive space-time diagram model is characterized by comprising the following steps of:
1) for a section of video, extracting target level features in each video frame by adopting a target detection technology, and combining the target level features to obtain initial region features in the video frames;
2) constructing a space-time diagram model consisting of a plurality of layers of space-time diagrams, wherein each layer of space-time diagram comprises a space diagram model and a timing diagram model, and performing space updating on the region characteristics by adopting the space diagram model;
constructing an anchor pipe corresponding to each target area in a video frame, sequentially updating according to the video frame, arranging targets in the anchor pipes according to a time sequence to form a time-space pipe, forming a time sequence diagram by using the targets in the time-space pipe, and performing time sequence updating on the area characteristics after the space updating;
taking the initial region characteristics obtained in the step 1) as the input of a first layer of space-time diagram, and taking the region characteristics after the time sequence update output by the previous layer of space-time diagram as the input of the next layer of space-time diagram to form a space-time diagram model consisting of multiple layers of space-time diagrams; a time sequence diagram output by the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding;
the step 2) is specifically as follows:
2.1) constructing a space map model, calculating the similarity between every two regions in each video frame, obtaining a similarity matrix corresponding to each video frame, and carrying out space updating on the region characteristics;
2.2) constructing an anchor pipe and a space-time pipe:
extracting a target area in a first video frame to initialize an anchor pipe set; calculating similarity scores between the targets in the current video frame and the targets in the anchor pipe, if the similarity scores are larger than a threshold value, adding the targets in the current video frame into the corresponding anchor pipe, and if not, adding the targets in the current video frame into the anchor pipe set as new anchor pipes; arranging the target sequence in each anchor pipe according to a time sequence to form a time-space pipe;
2.3) constructing a time sequence graph model according to the updated space-time tubes, and using targets in each space-time tube to form a time sequence graph;
2.4) constructing a graph convolution neural network on the basis of the time sequence diagram obtained in the step 2.3), and carrying out time sequence updating on the region characteristics after the space updating according to the region characteristics after the space updating in the step 2.1) and the similarity score obtained in the step 2.2);
2.5) forming a layer of space-time diagram in the steps 2.1) to 2.4), repeating the step, and taking the region characteristics after the time sequence output by the previous layer of space-time diagram is updated as the input of the next layer of space-time diagram to form a space-time diagram model consisting of a plurality of layers of space-time diagrams; after the last layer of space-time graph is processed, outputting a time sequence diagram as the final output of a space-time graph model, and obtaining the representation of the video tube level after time sequence GRU coding;
3) constructing a video-language Transformer model, comprising the space-time diagram model and the Transformer model in the step 2), taking a question sentence and the representation of the video pipe grade output in the step 2) as the input of the Transformer model, and training the video-language Transformer model according to the standard answer of the question; the Transformer model is a Transformer model pre-trained by adopting a picture-language data set;
4) and aiming at the question sentences to be processed, directly obtaining answers of the questions to be answered by using a trained video-language Transformer model.
2. The method for solving the video question-answer problem by improving video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the similarity in step 2.1) is the similarity between different regions in the same video frame and is obtained by using a trainable matrix calculation; the similarity score in the step 2.2) is the similarity between different areas in different video frames, and is obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity.
3. The method for solving the video question-answer problem by promoting video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the spatial update is calculated by the following formula:
Vs=GsPsWs
wherein, VsFor spatially updated regional features, GsIs a similarity matrix, PsAdopting the initial region characteristics in an initial state for the region characteristics output by the previous layer of space-time diagram; wsIs a trainable matrix.
4. The method of claim 1, wherein the time sequence update is calculated by the following formula:
Vt=GtPtWt
wherein, VtFor the region feature after the time sequence update, GtIs a similarity score, PtIs the region feature after spatial update in the space-time diagram of the layer, WtIs a trainable matrix.
5. The method for solving the video question-and-answer problem by promoting video-language representation learning by using the adaptive spatio-temporal graph model according to claim 1, wherein the spatio-temporal graph model is composed of 2 layers of spatio-temporal graphs.
6. The method for solving the video question-answer problem by improving the video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the step 1) is specifically as follows:
for a section of video, extracting target level features in each video frame by adopting a target detection technology, wherein the target level features in each video frame comprise a target position feature R, a target label feature O and a region geometric feature G, and R, O, G is combined to be used as initial region features in the video frame.
7. A system for solving a video question-and-answer problem by enhancing video-language representation learning using an adaptive space-time graph model, for implementing the method of claim 1, the system comprising:
a target detection module: the system comprises a video frame, a target level feature, a candidate frame and a region geometric feature, wherein the video frame is used for detecting the target level feature in each video frame, marking out the candidate frame and outputting a target position feature, a target label feature and the region geometric feature;
initial region feature combination module: the system comprises a target detection module, a target level detection module and a target level detection module, wherein the target level detection module is used for detecting target level characteristics of a target;
a multi-level space-time diagram module: configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to one time sequence updating treatment;
GRU coding module: the video tube level representation processing module is used for coding the timing sequence outputted by the multi-level space-time diagram module to obtain the representation of the video tube level;
a Transformer module: configuring a Transformer model, wherein two input ports of the Transformer model respectively read a question sequence and a representation of a video tube level output by a GRU (generalized regression Unit) coding module and output an answer of a question to be answered;
a parameter updating module: the method is used for updating parameters of the multi-level space-time diagram module and the Transformer module in a training stage;
the multi-level space-time diagram module comprises:
and (3) a space diagram model: the similarity matrix is used for calculating the similarity between every two areas in each video frame to obtain a similarity matrix corresponding to each video frame; carrying out spatial updating on the region characteristics;
the anchor pipe storage module: the anchor pipe is used for storing the corresponding of each target in the video frame;
the anchor pipe updating module: the system comprises an anchor pipe storage module, a video acquisition module and a video acquisition module, wherein the anchor pipe storage module is used for storing the target in the current video frame and the target in the anchor pipe storage module;
the space-time tube storage module: the system is used for arranging the target sequence in each anchor pipe according to a time sequence and outputting a time sequence chart according to an instruction;
the anchor pipe storage module, the anchor pipe updating module and the space-time pipe storage module are integrated into an integrated time sequence diagram model.
CN202010795917.0A 2020-08-10 2020-08-10 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model Active CN111652202B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010795917.0A CN111652202B (en) 2020-08-10 2020-08-10 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010795917.0A CN111652202B (en) 2020-08-10 2020-08-10 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Publications (2)

Publication Number Publication Date
CN111652202A CN111652202A (en) 2020-09-11
CN111652202B true CN111652202B (en) 2020-12-01

Family

ID=72350273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010795917.0A Active CN111652202B (en) 2020-08-10 2020-08-10 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Country Status (1)

Country Link
CN (1) CN111652202B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559698B (en) * 2020-11-02 2022-12-09 山东师范大学 Method and system for improving video question-answering precision based on multi-mode fusion model
CN112287122A (en) * 2020-11-11 2021-01-29 济南浪潮高新科技投资发展有限公司 Multi-mode-based cross-media knowledge extraction method
CN112465008B (en) * 2020-11-25 2021-09-24 电子科技大学 Voice and visual relevance enhancement method based on self-supervision course learning
CN112488055B (en) * 2020-12-18 2022-09-06 贵州大学 Video question-answering method based on progressive graph attention network
CN112860847B (en) * 2021-01-19 2022-08-19 中国科学院自动化研究所 Video question-answer interaction method and system
CN112733789B (en) * 2021-01-20 2023-04-18 清华大学 Video reasoning method, device, equipment and medium based on dynamic space-time diagram
CN114120045B (en) * 2022-01-25 2022-05-31 北京猫猫狗狗科技有限公司 Target detection method and device based on multi-gate control hybrid expert model
CN114707022B (en) * 2022-05-31 2022-09-06 浙江大学 Video question-answer data set labeling method and device, storage medium and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377711A (en) * 2019-07-01 2019-10-25 浙江大学 A method of open long video question-answering task is solved from attention network using layering convolution

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019749B (en) * 2018-09-28 2021-06-15 北京百度网讯科技有限公司 Method, apparatus, device and computer readable medium for generating VQA training data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377711A (en) * 2019-07-01 2019-10-25 浙江大学 A method of open long video question-answering task is solved from attention network using layering convolution

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Graph WaveNet for Deep Spatial-Temporal Graph Modeling;Zonghan Wu等;《arxiv.org》;20190531;第1-7页 *
Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning;Junchao Zhang等;《arxiv.org》;20190630;第1-10页 *
基于时空注意力网络的视频问答;杨启凡;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190131;第1-40页 *

Also Published As

Publication number Publication date
CN111652202A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111652202B (en) Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN108664589B (en) Text information extraction method, device, system and medium based on domain self-adaptation
CN108829756B (en) Method for solving multi-turn video question and answer by using hierarchical attention context network
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN104463101A (en) Answer recognition method and system for textual test question
Zhou et al. SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112949929B (en) Knowledge tracking method and system based on collaborative embedded enhanced topic representation
CN116091886A (en) Semi-supervised target detection method and system based on teacher student model and strong and weak branches
CN112861718A (en) Lightweight feature fusion crowd counting method and system
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN113807214B (en) Small target face recognition method based on deit affiliated network knowledge distillation
CN114333062B (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN110135435B (en) Saliency detection method and device based on breadth learning system
CN116597136A (en) Semi-supervised remote sensing image semantic segmentation method and system
CN114996495A (en) Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement
CN111144407A (en) Target detection method, system, device and readable storage medium
CN112926323B (en) Chinese named entity recognition method based on multistage residual convolution and attention mechanism
CN114169408A (en) Emotion classification method based on multi-mode attention mechanism
CN112860847A (en) Video question-answer interaction method and system
CN115797952B (en) Deep learning-based handwriting English line recognition method and system
CN117373111A (en) AutoHOINet-based human-object interaction detection method
CN116543250A (en) Model compression method based on class attention transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant