CN111652202B

CN111652202B - Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Info

Publication number: CN111652202B
Application number: CN202010795917.0A
Authority: CN
Inventors: 赵洲; 何金铮; 金韦克
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-12-01
Anticipated expiration: 2040-08-10
Also published as: CN111652202A

Abstract

The invention discloses a method and a system for solving a video question and answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model, and belongs to the field of video question and answer text generation. First, for a set of videos, question, answer training sets, information at the target level for each video frame is obtained using a target detector. Secondly, for the information at the target level, the dynamic expression of the target is learned by using an adaptive space-time diagram model. And finally, learning the relation between the visual information and the text information by using a Transformer model, and enhancing the performance of the visual question answering. Compared with a general video question-and-answer solution, the video question-and-answer method and the video question-and-answer system have the advantages that the space-time dynamic information of the target is better acquired by using the self-adaptive space-time diagram model, meanwhile, the same objects of different video frames are tried to be connected, the dynamic information is better captured, the video-language model is improved by adopting the picture-language data for pre-training, and the effect of solving the video question-and-answer problem is improved.

Description

Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Technical Field

The invention relates to the field of video question and answer text generation, in particular to a method and a system for solving a video question and answer problem by improving video-language representation learning through a self-adaptive space-time diagram model.

Background

A hot spot in visual language research is the understanding of visual content, language semantics and their interrelationships. Video question answering is one of the typical tasks. Recently, some BERT-style visual language pre-training methods have been proposed and shown to be effective at various tasks. In this work, the present invention also solved the problem of video question answering with a successful visual language Transformer.

In the existing technical solutions, for example, ViLBERT and LXMBERT both use mask technology based on intra-modality or cross-modality relationships for pre-training, and this training method is very similar to the BERT method. However, the existing labeled video data is very little, and the pre-training requires a large amount of data, so the above method is not ideal. To address the data problem, the visualBERT and CBT approaches attempt to use large amounts of unlabeled data on video websites for self-supervised pre-training. However, as the visual features of the videos are more dynamic and diverse, enough structured information is not available, so that the pre-training effect is not ideal. And such pre-training requires a large amount of computational resources, which is difficult to do with only a few gpu.

In addition, the model in the prior art usually only pays attention to space modeling or time modeling, lacks a space-time relationship, and is insufficient in modeling. And the relation between the first frame and the last frame is usually only concerned in the time modeling, and the application effect of the long video is poor.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a method and a system for solving the video question-answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model.

The invention discloses a method for solving a video question-answer problem by improving video-language representation learning by using a self-adaptive space-time diagram model, which comprises the following steps.

1) For a section of video, extracting target level features in each video frame by adopting a target detection technology, and combining the target level features to obtain initial region features in the video frames.

2) And constructing a space-time diagram model consisting of a plurality of layers of space-time diagrams, wherein each layer of space-time diagram comprises a space diagram model and a timing diagram model, and the space diagram model is adopted to perform space updating on the region characteristics.

And constructing an anchor pipe corresponding to each target area in the video frame, updating in sequence according to the video frame, arranging targets in the anchor pipes according to a time sequence to form a time-space pipe, constructing a time sequence diagram by using the targets in the time-space pipe, and updating the time sequence of the area characteristics after the space updating.

Taking the initial region characteristics obtained in the step 1) as the input of a first layer of space-time diagram, and taking the region characteristics after the time sequence update output by the previous layer of space-time diagram as the input of the next layer of space-time diagram to form a space-time diagram model consisting of multiple layers of space-time diagrams; and the output of the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding.

3) Constructing a video-language Transformer model, comprising the space-time diagram model and the Transformer model in the step 2), taking the question sentence and the representation of the video pipe level output in the step 2) as the input of the Transformer model, and training the video-language Transformer model according to the standard answer of the question.

4) And aiming at the question sentences to be processed, directly obtaining answers of the questions to be answered by using a trained video-language Transformer model.

It is another object of the present invention to provide a system for implementing the above method.

The method specifically comprises the following steps:

a target detection module: the method is used for detecting target level features in each video frame, marking out candidate frames and outputting target position features, target label features and region geometric features.

Initial region feature combination module: and the system is used for combining the target level characteristics output by the target detection module to obtain initial region characteristics.

A multi-level space-time diagram module: and configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to time sequence updating processing once.

GRU coding module: and the method is used for coding the timing sequence outputted by the multi-level space-time diagram module to obtain the representation of the video tube level.

A Transformer module: and a Transformer model is configured, and two input ports of the Transformer model respectively read the question sequence and the representation of the video tube level output by the GRU coding module and output the answer of the question to be solved.

A parameter updating module: and the method is used for updating the parameters of the multi-level space-time diagram module and the Transformer module in the training phase.

Compared with the prior art, the invention has the following beneficial effects.

(1) The invention utilizes an image language pre-training Transformer module to help video language modeling. Because the existing video language data and the pre-training model are very rich, the marked structured data in the existing video-language data is very little, and the pre-training requires very large computing resources, the invention solves the defects of unsatisfactory pre-training effect and extremely high resource consumption in the traditional method of using the video-language data, and the invention achieves the great improvement of the effect.

(2) The invention adopts the self-adaptive space-time diagram model to model the space-time relation of the target, the traditional method usually only pays attention to space modeling or time modeling alone, and does not integrate the space-time modeling.

(3) In the processes of updating an anchor pipe and establishing a spatio-temporal pipe, the traditional method usually mainly focuses on a first frame and a last frame, neglects intermediate frame information, and has great influence on a longer video. The invention adopts a frame-by-frame updating method, successfully models the target information of the intermediate frame by means of designing a threshold value and the like, and ensures the application effect on the long video.

Drawings

FIG. 1 is a schematic overall flow chart of the method for solving the video question-answer problem by promoting video-language representation learning by using the adaptive space-time diagram model according to the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the method for solving the video question-answer problem by improving video-language representation learning by using an adaptive space-time diagram model mainly includes the following steps.

1) For the input video, the target level feature of each video frame is extracted by using a target detection technology, and the initial region feature is further obtained.

2) And updating the initial region characteristics by using an adaptive space-time diagram model to learn dynamic target characteristics to obtain the characteristics of the video tube level.

3) Constructing a video-language Transformer model, comprising the spatio-temporal graph model and the Transformer model in the step 2), taking the question sentences and the representation of the video tube level output in the step 2) as the input of the Transformer model, simultaneously symbolizing the questions to be solved and then taking the symbolized questions as the input, and training the video-language Transformer model according to the standard answers of the questions.

In one embodiment of the present invention, a video-language Transformer model is constructed in which the Transformer model is pre-trained using a picture-language dataset. The output of the Transformer model is subjected to a multi-layer perceptron and standard answers together to realize the training of the video-language Transformer model.

In one specific implementation of the present invention, the initial region feature obtained in step 1) is obtained by the target detector and the feature combination moduleFirst, the following features are generated by the target detector: location features of objects

Object tag feature

And geometric characteristics of the region

Where d represents the feature size, c represents the class of the object, and l represents the geometric information. Passing the above characteristics through the formula

And (4) combining. Wherein W₁,W₂,W₃,W₄Are all weight matrices. N is a radical of_tIndicating that the t-th frame has N objects, the subscript N indicating the number of objects, the superscript t indicating the t-th frame in the video,

is the position characteristic of the ith target in the t frame in the video,

is the tag feature of the ith object in the tth frame in the video,

is the geometric feature of the ith object in the tth frame in the video,

is the combined characteristic of the ith target of the t frame in the video, namely the initial region characteristic.

In one embodiment of the present invention, the characterization at the video tube level is obtained by:

2.1) constructing a space map model, calculating the similarity between every two regions in each video frame, and obtaining a similarity matrix corresponding to each video frame; and updating the space of the region features.

2.2) constructing an anchor pipe and a space-time pipe:

the present invention attempts to relate the same objects of different video frames, a structure known as space-time tube. The invention constructs a space-time tube by using an anchor tube, wherein the anchor tube refers to a sequence formed by the same target in different video frames. An anchor pipe set refers to a set of multiple different anchor pipes.

Extracting a target area in a first video frame to initialize an anchor pipe set; calculating similarity scores between the targets in the current video frame and the targets in the anchor pipe, if the similarity scores are larger than a threshold value, adding the targets in the current video frame into the corresponding anchor pipe, and if not, adding the targets in the current video frame into the anchor pipe set as new anchor pipes; and arranging the target sequences in each anchor pipe according to a time sequence to form a time-space pipe.

And 2.3) constructing a time sequence graph model according to the updated space-time tubes, and constructing a time sequence graph by using targets in each space-time tube.

2.4) constructing a graph convolution neural network on the basis of the time sequence diagram obtained in the step 2.3), and carrying out time sequence updating on the region characteristics after the space updating according to the region characteristics after the space updating in the step 2.1) and the similarity score obtained in the step 2.2).

2.5) forming a layer of space-time diagram in the steps 2.1) to 2.4), repeating the step, and taking the region characteristics after the time sequence output by the previous layer of space-time diagram is updated as the input of the next layer of space-time diagram to form a space-time diagram model consisting of a plurality of layers of space-time diagrams; and the output of the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding. 2.1) -2.4), the whole process is repeated for M times, M is the number of layers of the adaptive space-time diagram model, and M is preferably 2.

The similarity in the step 2.1) is the similarity between different areas in the same video frame and is obtained by adopting trainable matrix calculation; the similarity score in the step 2.2) is the similarity between different areas in different video frames, and is obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity.

Specifically, the method comprises the following steps:

for each sampled video frame, to obtain the association between each two objects, the present invention constructs a spatial map in which highly correlated instances are given a higher confidence value. To achieve this, the invention adopts

The similarity between each two regions is described. Where W ', W' is a trainable matrix, the indices i and j denote the ith, jth object, and the superscript t denotes the tth frame. The result of the similarity calculation constitutes a similarity matrix G^s，G^sRegularization is performed by the softmax function. Then, the invention adopts a graph convolution neural network on the space graph and utilizes the space updating formula V_s＝G^sP_sW_sTo update the target feature. Wherein, V_sFor spatially updated regional features, G^sWhich is a similarity matrix, can be viewed as an adjacency matrix. P_sAdopting the initial region characteristics in an initial state for the region characteristics output by the previous layer of space-time diagram; w_sIs a learnable weight matrix.

The anchor tube set is initialized with the corresponding target region of the first frame. The anchor pipe set is then dynamically updated based on the objects within the video. To achieve this, the present invention uses the following formula to calculate the similarity of two target regions in different frames.

sim(i,j)＝(S_a(i,j)+S_r(i,j)+S_g(i,j))/3

Wherein S_a(i, j) represents the similarity of visual manifestations, S_r(i, j) represents the similarity of two regions in view of critical features. S_g(i, j) represents the similarity of the spatial geometry of the two target regions. Using the similarity scores, a time-space tube for each target is constructed. Starting from the second frame, for each region, the current region is compared to the region inside the anchor tube, and a similarity score is calculated. At the same time, a threshold is set, and if a certain threshold is exceeded, the current region is added toIn the anchor tube.

Wherein S is_aThe calculation formula of (i, j) is:

wherein L represents the Euclidean distance,

representing the regional characteristics of the ith region of the tth frame.

Representing the region characteristics of the jth region of the t' th frame.

S_rThe calculation formula of (i, j) is:

wherein the content of the first and second substances,

is the feature after convolution updating of the ith target image of the tth frame,

representing the updated features of the jth region of the t' th frame.

S_gThe calculation formula of (i, j) is:

S_g(i,j)＝(S_iou(i,j)+S_area(i,j))/2

wherein, area represents the space characteristic of the target area, and Ar represents the area of the target area.

In the time sequence updating process, the areas in each time-space tube are arranged according to the time sequence and are connected into an undirected graph to form a time sequence graph. A graph convolution neural network is constructed on the timing diagram. Using the similarity score obtained in the step 2.3) as an adjacency matrix G^t. The graph convolution neural network passes through the formula

V_t＝G^tP_tW_t

And updating the time sequence. Wherein, V_tFor the region feature after the time sequence update, G^tIs a similarity score, P_tIs the region feature after spatial update in the space-time diagram of the layer, W_tIs a trainable matrix.

In another embodiment of the present invention, a system for solving video question-and-answer problems by enhancing video-language representation learning using an adaptive spatiotemporal graph model is presented.

The method comprises the following steps:

Initial region feature combination module: the system comprises a target detection module, a target level detection module and a target level detection module, wherein the target level detection module is used for detecting target level characteristics of a target; the combination mode can be set according to needs, and the preferable combination mode is as follows:

wherein

Showing the initial regional characteristics of the ith target of the combined tth frame,

respectively representing the target position feature, the target label feature and the region geometric feature output by the target detection module.

A multi-level space-time diagram module: and configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to timing diagram updating processing once.

Wherein, the multi-level space-time diagram module further comprises:

and (3) a space diagram model: the similarity matrix is used for calculating the similarity between every two areas in each video frame to obtain a similarity matrix corresponding to each video frame; and updating the space of the region features.

The anchor pipe storage module: the anchor pipe is used for storing the corresponding anchor pipe of each target in the video frame.

The anchor pipe updating module: the system comprises an anchor pipe storage module, a video acquisition module and a video acquisition module, wherein the anchor pipe storage module is used for storing the target in the current video frame and the target in the anchor pipe storage module; specifically, if the similarity score is greater than the threshold, the target in the current video frame is added to the corresponding anchor pipe, otherwise, the target in the current video frame is added to the anchor pipe set as a new anchor pipe.

The space-time tube storage module: the method is used for arranging the target sequences in each anchor pipe according to time sequence, and can output a time sequence diagram according to the instruction, and particularly, only the last layer of space-time diagram in the space-time diagram model needs to output a final time sequence diagram according to the instruction.

The anchor pipe storage module, the anchor pipe updating module and the space-time pipe storage module are integrated into an integrated time sequence diagram model.

The similarity score calculator in the spatial graph model is used for calculating the similarity between different areas in the same video frame and is obtained by adopting trainable matrix calculation, an arbitrary value can be given as an initial value for setting a weight matrix in a calculation formula, and an optimal value can be obtained by learning in the training process.

The similarity score calculator in the time sequence graph model is used for calculating the similarity between different areas in different video frames, and can be obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity, or calculating other weight values.

In the specific embodiments provided in the present application, it should be understood that the above-described system embodiments are merely illustrative, and for example, the multi-level space-time diagram module may be a logical functional partition, and may have another partition in actual implementation, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.

Examples

In order to further demonstrate the implementation effect of the present invention, the present embodiment tests the overall algorithm of the adaptive space-time diagram model on a plurality of training sets.

Table 1: algorithm of adaptive space-time diagram model entirety

For pre-training of image languages, the model referred to in the present invention is pre-trained on a conceptual header dataset, which has approximately 300 million image header pairs, an order of magnitude larger than the COCO header dataset. It also includes broader pictures and descriptive text. Therefore, it is well suited for image language pre-training tasks. For the video language task, the invention adopts two widely used video answering data sets MSVD-QA and MSRVTTQA. The two data sets are from the video subtitle data set MSVD and the MSRVTT data set, respectively. The MSVD-QA dataset consists of 1970 video segments and 50,505 answers to questions. The MSRVTT-QA dataset is much larger and consists of 10k video clips and 243k challenge-response pairs. As shown in tables 2-3, both data sets contain five general types of questions, what, who, how, when, and where. Of these, the "what" and "who" types of questions account for a major proportion.

In order to objectively evaluate the performance of the algorithm of the present invention, the present invention proposes two versions, a base version and a full version, respectively. The basic version is trained using only the video answer dataset, while the full version is pre-trained using the picture-language dataset. The results of the present invention on the MSVD-QA and MSRVTT-QA datasets are presented below.

TABLE 2 results of the MSVD-QA test for which the present invention is directed

Type of problem	What	Who	How	When	Where	All
							Basic edition	22.8	54.0	74.3	72.4	53.6	35.4
Full edition	24.6	53.6	75.7	70.7	53.6	36.3

TABLE 3 results of the MSRVTT-QA test for which the present invention is directed

Type of problem	What	Who	How	When	Where	All
							Basic edition	30.0	46.8	82.7	77.3	36.0	36.3
Full edition	31.1	48.5	81.2	77.7	34.0	37.6

All the data represent the accuracy of the answers, and it can be seen that the full version is improved better than the basic version, which indicates that the pre-training process of the picture-language data set is very effective. The five types of questions of 'what', 'who', 'how', 'when' and 'where' all achieve high effect.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A method for solving video question-answering problems by improving video-language representation learning by utilizing an adaptive space-time diagram model is characterized by comprising the following steps of:

1) for a section of video, extracting target level features in each video frame by adopting a target detection technology, and combining the target level features to obtain initial region features in the video frames;

2) constructing a space-time diagram model consisting of a plurality of layers of space-time diagrams, wherein each layer of space-time diagram comprises a space diagram model and a timing diagram model, and performing space updating on the region characteristics by adopting the space diagram model;

constructing an anchor pipe corresponding to each target area in a video frame, sequentially updating according to the video frame, arranging targets in the anchor pipes according to a time sequence to form a time-space pipe, forming a time sequence diagram by using the targets in the time-space pipe, and performing time sequence updating on the area characteristics after the space updating;

taking the initial region characteristics obtained in the step 1) as the input of a first layer of space-time diagram, and taking the region characteristics after the time sequence update output by the previous layer of space-time diagram as the input of the next layer of space-time diagram to form a space-time diagram model consisting of multiple layers of space-time diagrams; a time sequence diagram output by the last layer of space-time diagram is used as the final output of the space-time diagram model, and the representation of the video tube level is obtained after the time sequence GRU coding;

the step 2) is specifically as follows:

2.1) constructing a space map model, calculating the similarity between every two regions in each video frame, obtaining a similarity matrix corresponding to each video frame, and carrying out space updating on the region characteristics;

2.2) constructing an anchor pipe and a space-time pipe:

extracting a target area in a first video frame to initialize an anchor pipe set; calculating similarity scores between the targets in the current video frame and the targets in the anchor pipe, if the similarity scores are larger than a threshold value, adding the targets in the current video frame into the corresponding anchor pipe, and if not, adding the targets in the current video frame into the anchor pipe set as new anchor pipes; arranging the target sequence in each anchor pipe according to a time sequence to form a time-space pipe;

2.3) constructing a time sequence graph model according to the updated space-time tubes, and using targets in each space-time tube to form a time sequence graph;

2.4) constructing a graph convolution neural network on the basis of the time sequence diagram obtained in the step 2.3), and carrying out time sequence updating on the region characteristics after the space updating according to the region characteristics after the space updating in the step 2.1) and the similarity score obtained in the step 2.2);

2.5) forming a layer of space-time diagram in the steps 2.1) to 2.4), repeating the step, and taking the region characteristics after the time sequence output by the previous layer of space-time diagram is updated as the input of the next layer of space-time diagram to form a space-time diagram model consisting of a plurality of layers of space-time diagrams; after the last layer of space-time graph is processed, outputting a time sequence diagram as the final output of a space-time graph model, and obtaining the representation of the video tube level after time sequence GRU coding;

3) constructing a video-language Transformer model, comprising the space-time diagram model and the Transformer model in the step 2), taking a question sentence and the representation of the video pipe grade output in the step 2) as the input of the Transformer model, and training the video-language Transformer model according to the standard answer of the question; the Transformer model is a Transformer model pre-trained by adopting a picture-language data set;

2. The method for solving the video question-answer problem by improving video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the similarity in step 2.1) is the similarity between different regions in the same video frame and is obtained by using a trainable matrix calculation; the similarity score in the step 2.2) is the similarity between different areas in different video frames, and is obtained by calculating the average value of the visual feature similarity, the critical feature similarity and the spatial geometric feature similarity.

3. The method for solving the video question-answer problem by promoting video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the spatial update is calculated by the following formula:

V_s＝G^sP_sW_s

wherein, V_sFor spatially updated regional features, G^sIs a similarity matrix, P_sAdopting the initial region characteristics in an initial state for the region characteristics output by the previous layer of space-time diagram; w_sIs a trainable matrix.

4. The method of claim 1, wherein the time sequence update is calculated by the following formula:

V_t＝G^tP_tW_t

wherein, V_tFor the region feature after the time sequence update, G^tIs a similarity score, P_tIs the region feature after spatial update in the space-time diagram of the layer, W_tIs a trainable matrix.

5. The method for solving the video question-and-answer problem by promoting video-language representation learning by using the adaptive spatio-temporal graph model according to claim 1, wherein the spatio-temporal graph model is composed of 2 layers of spatio-temporal graphs.

6. The method for solving the video question-answer problem by improving the video-language representation learning by using the adaptive space-time diagram model according to claim 1, wherein the step 1) is specifically as follows:

for a section of video, extracting target level features in each video frame by adopting a target detection technology, wherein the target level features in each video frame comprise a target position feature R, a target label feature O and a region geometric feature G, and R, O, G is combined to be used as initial region features in the video frame.

7. A system for solving a video question-and-answer problem by enhancing video-language representation learning using an adaptive space-time graph model, for implementing the method of claim 1, the system comprising:

a target detection module: the system comprises a video frame, a target level feature, a candidate frame and a region geometric feature, wherein the video frame is used for detecting the target level feature in each video frame, marking out the candidate frame and outputting a target position feature, a target label feature and the region geometric feature;

initial region feature combination module: the system comprises a target detection module, a target level detection module and a target level detection module, wherein the target level detection module is used for detecting target level characteristics of a target;

a multi-level space-time diagram module: configuring a space-time diagram model, wherein the space-time diagram model consists of multiple layers of space-time diagrams, each layer of space-time diagram comprises an input interface and an output interface, the output interface of the previous layer of space-time diagram is connected with the input interface of the next layer of space-time diagram, the input interface of the first layer of space-time diagram is used for reading the initial region characteristics output by the initial region characteristic combination module, and each layer of space-time diagram is subjected to one time sequence updating treatment;

GRU coding module: the video tube level representation processing module is used for coding the timing sequence outputted by the multi-level space-time diagram module to obtain the representation of the video tube level;

a Transformer module: configuring a Transformer model, wherein two input ports of the Transformer model respectively read a question sequence and a representation of a video tube level output by a GRU (generalized regression Unit) coding module and output an answer of a question to be answered;

a parameter updating module: the method is used for updating parameters of the multi-level space-time diagram module and the Transformer module in a training stage;

the multi-level space-time diagram module comprises:

and (3) a space diagram model: the similarity matrix is used for calculating the similarity between every two areas in each video frame to obtain a similarity matrix corresponding to each video frame; carrying out spatial updating on the region characteristics;

the anchor pipe storage module: the anchor pipe is used for storing the corresponding of each target in the video frame;

the anchor pipe updating module: the system comprises an anchor pipe storage module, a video acquisition module and a video acquisition module, wherein the anchor pipe storage module is used for storing the target in the current video frame and the target in the anchor pipe storage module;

the space-time tube storage module: the system is used for arranging the target sequence in each anchor pipe according to a time sequence and outputting a time sequence chart according to an instruction;