CN111414845A

CN111414845A - Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network

Info

Publication number: CN111414845A
Application number: CN202010191264.5A
Authority: CN
Inventors: 赵洲; 张品涵; 张竹
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2020-07-14
Anticipated expiration: 2040-03-18
Also published as: CN111414845B

Abstract

The invention discloses a method for solving a polymorphic sentence video positioning task by a space-time graph reasoning network, belonging to the field of natural language visual positioning. The invention firstly analyzes the video into a space-time region graph, which not only has an implicit and explicit spatial sub-graph of each frame, but also has a cross-frame time dynamic sub-graph. Next, a text cue is added to the spatio-temporal region graph, and a multi-step cross-modal graph inference is established. The multi-step process may support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline. The invention does not need to prune the video when positioning the natural language, thus reducing the cost of video positioning; the method can effectively process the question sentences and the statement sentences, provides technical support for the combination research of higher-level natural language processing and computational vision (such as video question answering) and has wide application prospect.

Description

Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network

Technical Field

The invention relates to the field of natural language visual positioning, in particular to a method for solving a polymorphic sentence video positioning task by using a space-time graph reasoning network.

Background

Visual localization of natural language is a fundamental and crucial task in the field of visual understanding. The goal of this task is to locate objects described by a given natural language in visual content from a temporal, spatial perspective. In recent years, researchers have begun to focus on natural language (sentence) localization in video, including temporal localization and spatio-temporal localization. The time positioning can obtain the time slice of the object appearing in the video; the space-time positioning is to obtain the region where the object appears based on the time positioning, and the set of regions where such a series of objects are located is also called space-time tube (spatial-temporal tube) because of the temporal and spatial continuity.

At present, the method realized by people is less and has stronger limitation. Existing video localization methods often extract a set of spatio-temporal pipelines from the pruned video and then identify the target pipeline that matches the sentence. However, this framework may not be able to accomplish Spatio-Temporal Video localization (STVG) for polymorphic Sentences. On the one hand, the performance of the framework depends largely on the quality of the candidate pipeline, but it is difficult to generate a high-quality pipeline in advance without textual clues, because sentences may describe the short-term state of the object in a very small segment, but the existing pipeline pre-generation framework can only produce a complete object pipeline in the pruned video. On the other hand, these methods only consider single-pipe modeling, and ignore the relationships between objects, and thus cannot process question sentences with position objects, but can only process traditional statement sentences. However, object relationships are an important clue for STVG tasks, especially for interrogatories that may only provide interaction of unknown objects with other objects, because of lack of explicit characteristics of the objects, locating interrogatories can only depend on relationships (e.g., action relationships and spatial relationships) between unknown objects and other objects, and it is important to explain the construction of relationship models and cross-modal relationship reasoning. Therefore, the existing method cannot process STVG tasks.

In addition, the existing visual map modeling method often constructs a spatial map in an image, and cannot distinguish subtle differences of object actions, such as door opening and door closing, by utilizing time dynamic information in a video. Therefore, there is a need for a method that can solve the task of video localization of polymorphic sentences, locating the spatio-temporal pipeline of the queried object for a given un-cropped video and a statement or question regarding the descriptive object.

Disclosure of Invention

Aiming at the defect that the prior art can not solve the video positioning task of polymorphic sentences, the invention provides a method for solving the video positioning task of polymorphic sentences by using a space-time graph inference network. The spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the time dynamic subgraph can take the dynamic property of the object and the cross-frame transformation into consideration so as to further improve the understanding of the network to the relation between the objects. Next, text cues are added to the space-time region graph, a multi-step cross-modal graph inference is established, and a multi-step process can support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network comprises the following steps:

s1: aiming at a section of video, extracting the visual characteristics of each frame in the video by using a Faster-RCNN network to form a visual characteristic set of video frames; extracting K regions from each video frame to obtain region characteristic vectors and region frame vectors to form a region set of a frame level in the video;

s2: aiming at a query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally further obtaining the query feature of the query statement by adopting an attention method;

s3: establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing a video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain a cross-modal regional feature; then, performing T-step convolution operation on the space-time region graph through T space-time convolution layers according to the cross-modal region characteristics to finally obtain relationship sensitive region characteristics;

s4: establishing a space-time locator comprising a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; integrating the query features of the query statement and the final frame features through a space locator to obtain a matching score of each region in each video frame;

s5: the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;

s6: and (5) screening the frame t and the area i corresponding to the highest matching score obtained in the step (S5), calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of the pipeline according to the link score, and obtaining the space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.

Further, the step S1 specifically includes:

for a piece of video, extracting K regions from each video frame by using pre-trained fast-RCNN to obtain a region set of a frame level in the video

Each region

There are two attributes: one is a region feature vector

Visual feature vector representing the ith region of the t-th frame in video, d_rRepresenting the dimensions of the region feature vectors; the other is a region bounding box vector

wherein

And

respectively representing the abscissa and the ordinate of the central point of the bounding box of the ith area of the t-th frame in the video,

and

respectively representing the width and the height of a bounding box of the ith area of the t frame in the video;

in addition, the visual features of each frame in the video are extracted by the fast-RCNN to form a visual feature set of the video frames

wherein f^tThe visual characteristics of the t-th frame in the video are shown, and N represents the frame number of the video.

Further, the step S2 specifically includes:

aiming at a query sentence, firstly, a GloVe network is adopted to obtain a word embedding vector of each word in the query sentence, then a BiGRU network is adopted to obtain word semantic features of the query sentence, and a word semantic feature set is formed

wherein ,s_iIs the semantic feature of the ith word and is formed by connecting the forward hidden state and the backward hidden state of the ith node of the BiGRU network, L represents the number of words in the query sentence, d_sA dimension representing a semantic feature vector of a word;

from a set of word semantic features

In selecting semantic features s of query objects^eObtaining entity sensitivity s by attention method^aSaid entity sensitive characteristic s^aSemantic features s with query objects^eComposing query features s^q(ii) a The formula is as follows:

s^q＝[s^e；s^a]

wherein ,

and

is a parameter matrix, s^aIs an entity sensitive feature, s^qIs a query feature, gamma_iRepresenting normalized weights.

Further, the step S3 specifically includes:

establishing a space-time graph encoder, which comprises a video analysis layer, a cross-mode fusion layer and T space-time convolution layers, wherein the working steps of the space-time graph encoder are as follows:

3.1) resolving the video into a space-time region graph through a video resolving layer, wherein the space-time region graph comprises three sub-graphs: implicit spatial subgraph in each frame

Explicit spatial subgraph in each frame

And cross-frame temporal dynamic subgraph

Wherein

Is the vertex of each sub-graph, and all three sub-graphs regard the region in each corresponding video frame as its vertex

_imp、_exp、_temRespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;

3.2) fusing the region feature vector obtained in the step S1 and the word semantic features obtained in the step S2 through a cross-modal fusion layer to obtain cross-modal region features, which are specifically as follows:

for ith area of t frame in video

Computing word semantic features

The attention weight of (a) is given,further obtaining the regional sensitive text characteristics, wherein the formula is as follows:

wherein,

and

is a parameter matrix, b^mIs an offset, w^TIs a row vector of the parameters which,

to represent

And s_jThe degree of similarity of (a) to (b),

the weight of attention is represented as a weight of attention,

is the regional sensitive text feature of the ith region of the t frame in the video;

establishing a text gate guided by language information and sensitive to regional text characteristics

The text-independent region is attenuated, and the formula is as follows:

where, σ is the sigmoid function,

indicating area

Textbook of d_rRepresenting the dimensions of the region feature vectors;

will be provided with

Connected to obtain cross-modal region features

The formula is as follows:

wherein, ⊙ is the multiplication element by element,

the cross-modal region feature of the ith region of the tth frame in the video is represented;

3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer; the space map convolutional layer is used for obtaining the visual relationship between each frame region, and the specific details are as follows:

for cross-modal region features, first sub-map in implicit space

The implicit graph convolution is adopted, and the formula is as follows:

wherein,

is that

Neutralization of

The area of the connection is the area of the connection,

is a weight parameter, W^impAnd U^impA matrix of the parameters is represented and,

representing the output of the implicit spatial map convolutional layer;

then in the explicit space subgraph

The above adopts explicit graph convolution, and the formula is as follows:

α^exp＝Softmax(W^rs^q+b^m)

wherein,

representing the output of the explicit spatial map convolutional layer, dir (i, j) is the direction of the edge (i, j),

is an optional parameter matrix, lab (i, j) is the label of edge (i, j),

is an optional offset that is set off by the user,

is that

Neutralization of

Region of connection, W^rIs a parameter matrix, b^mIs offset α^expIs a coefficient of relationship, and

weights corresponding to 51 tags;

a relational weight representing label selection by edge (i, j);

the time map convolutional layer is used for obtaining the dynamic property and the transformational property of the cross-frame object, and specifically comprises the following steps:

dynamic subgraph in time

The time chart convolution is adopted, and the formula is as follows:

wherein,

and

is a parameter matrix, dir (i, j) indicates the direction in which the edge (i, j) of the corresponding parameter matrix is selected,

is a region

The semantic coefficients of each of the neighborhoods,

representing the output, U, of the time-graph convolution layer^temRepresenting a parameter matrix;

combining the outputs of the space-time convolution layer and the time-map convolution layer to obtain a result for the first space-time convolution layer

Multi-step encoding (will be) by a spatio-temporal image encoder with T spatio-temporal convolutional layers

Is obtained as an input

) Obtaining the final relation sensitive area characteristics

And the relation sensitive area characteristic of the ith area of the tth frame in the video is shown.

In implicit spatial subgraphs_impThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.

Of explicit spatial subgraphs_expThe construction method comprises the following steps:

extracting region triplets in each video frame

As

To

Wherein, wherein

And

is the ith area and the jth area of the tth frame in the video,

is that

And

the relationship predicate between the two, namely the label of the edge;

given the characteristics of the region i

Characteristics of region j

Joint feature of a joint region of two regions

The joint characteristics of the joint region are also obtained by fast-RCNN; inputting the three characteristics into a classifier trained on a Visual Genome data set in advance, and predicting to obtain a region

And

the relationship predicates between.

In a temporal dynamic subgraph_temThe construction method comprises the following steps:

calculating the connection scores between each video frame and the regions in the M adjacent forward frames and M adjacent backward frames:

where cos (-) is the cosine similarity of the two features, IoU (-) is the intersection ratio of the two regions, ∈ is the balance scalar,

representing the connection fraction between the ith area of the t frame and the jth area of the k frame in the video;

for the

Selecting the region with the highest connection score from the k frame of the video

To construct an edge, each region obtaining a 2M +1 edge including a self-loop; edges of temporal dynamics graph_temNo mark, with three orientations: forward, backward and self-looping.

Further, the step S4 specifically includes:

4.1) set up the time locator, first aggregate the relationship sensitive region features to the frame level by the attention mechanism, for the query feature s^qFrame level relationship sensitive feature m in video^tExpressed as:

wherein m is^tRepresenting the relation sensitive characteristic of the t frame in the video; w is a^TA row vector of the parameters is represented,

representing a parameter matrix, b^fRepresents a bias;

will be in the videoFrame-level set of relationship-sensitive features

Set of visual features of video frame corresponding thereto

Concatenate and use another BiGRU to learn the final frame feature set

Next, at each frame t, a multiscale candidate clip is defined as

Wherein

Is the starting and ending boundary, w, of the kth clip of the t-th frame in the video_kIs the width of the kth clip, P is the number of clips; then, all candidate clips are estimated through the sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:

C^t＝σ(W^c[h^t；s^q]+b^c)

^t＝W^o[h^t；s^q]+b^o

wherein,

the confidence scores corresponding to the P candidate clips at frame t,

is the offset of P segments, W^cAnd W^oIs a parameter matrix, b^cAnd b^oIs the offset, σ (·) is the sigmoid function;

the time locator described has two losses: clipping selected alignment loss and boundary adjusted regression loss; the alignment loss formula is as follows:

wherein,

representing the temporal intersection ratio of the kth candidate clip with the correct clip at frame t,

is represented by C^tThe kth element of (1), i.e., the confidence score of the kth candidate clip at frame t;

next, the fine tuning has the highest

The boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)_s,_e) First, according to the real boundary

Calculating the offset of the clip

And

the regression loss formula is as follows:

wherein R represents a smoothing L1 function;

4.2) establishing a spatial locator for locating the target area in each frame, the relation sensitive area being characterized by

By characterizing the query as s^qAnd final frame feature h^tPerforming fusion to estimate matching score of each region

The formula is as follows:

wherein,

is the matching score of the ith area of the t frame in the video, sigma (-) is sigmoid function, W^cIs a parameter matrix, b^cIs an offset;

the space loss formula is as follows:

wherein S is_tIs the set of frames in real time,

is a region

Cross-over ratio between spatial and real area;

the multitask penalty function is as follows:

wherein λ is₁、λ₂And λ₃Is a hyper-parameter that controls the balance between the three losses.

Further, the step S5 specifically includes:

the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;

further, the step S6 specifically includes:

the calculation formula of the link score is as follows:

where s (-) represents the link score,

and

is a region

And

θ is a balance scalar set to 0.2, IoU (-) is a cross-over function;

the energy is calculated as follows:

wherein E (-) represents energy and (T)_e,T_s) Is a time boundary, Y represents a pipeline, and directly uses Vitervi algorithm to obtain a region set which maximizes E (Y)

As the final pipe Y.

The invention has the following beneficial effects:

(1) the invention does not need to prune the video when positioning the natural language, can directly process the long video, and reduces the cost of video positioning;

(2) the space-time region graph obtained in the visual graph modeling process not only has implicit and explicit spatial subgraphs of each frame, but also has cross-frame time dynamic subgraphs; the spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the temporal dynamic subgraph can take the dynamic property and the cross-frame transformation of the object into consideration, effectively utilizes the temporal dynamic information in the video to distinguish the subtle difference of the action of the object, so as to further improve the understanding of the network to the relation between the objects;

(3) the present invention introduces a spatio-temporal locator to retrieve the spatio-temporal pipeline of objects directly from the region. Specifically, a time locator is used for determining the time boundary of the pipeline, then a space locator with a dynamic selection method is used for locating an object in each frame and generating a smooth pipeline, question sentences and statement sentences can be effectively processed, video location of polymorphic sentences is realized, a large number of experiments prove the effectiveness of the method, and technical support is provided for higher-level natural language processing and computational vision combined research (such as video question answering);

(4) the invention has wide application prospect, such as directly searching video content and classifying video by using characters.

Drawings

FIG. 1 is a schematic diagram of the STGRN structure of the present invention;

fig. 2 is an experimental result of the criteria m _ tilou and m _ vliou for statement sentences and question sentences.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the present invention aims at a section of video and input sentences, and locates the objects in the sentences in each frame by using a method of solving the polymorphic sentence video location task by using a space-time graph inference network, and generates a smooth pipeline, and the specific steps are as follows:

the method comprises the steps that firstly, visual features of each frame in a video are extracted by utilizing a Faster-RCNN network aiming at a section of video, and a visual feature set of video frames is formed; and extracting K regions from each video frame to obtain region feature vectors and region frame vectors, and forming a region set of a frame level in the video.

And secondly, aiming at the query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then, obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally, further obtaining the query feature of the query statement by adopting an attention method.

Step three, establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing the video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain cross-modal regional features; and performing T-step convolution operation on the space-time region diagram through T space-time convolution layers according to the cross-modal region characteristics to finally obtain the relation sensitive region characteristics.

Step four, establishing a space-time locator which comprises a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; and then integrating the query features of the query statement and the final frame features through a spatial locator to obtain the matching score of each region in each video frame.

Step five, the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; and for the section of video processed in the step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN.

And step six, screening the frame t and the area i corresponding to the highest matching score obtained in the step five, calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of a pipeline according to the link score, and obtaining a space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.

Examples

The invention establishes a large-scale space-time video positioning data set VidSTG by adding sentence annotation on VidOR (Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xin Yang, and Tat-Seng Chua. inhibiting objects and relationships in used generated video ICMR, pages 279-287. ACM,2019.), and carries out verification on the VidSTG data set. VidOR is the largest existing video data set containing object relationships, containing 10,000 videos and objects and fine-grained annotation of the relationships therein. VidOR annotates 80 object classes with dense bounding boxes and 50 relationship predicate classes (8 spatial relationships and 42 action relationships) between objects, representing the relationships as triples < subject, predicate, object >, each associated with a time boundary and a space-time pipe (to which the subject and object belong). Appropriate triples are selected based on VidOR and the subject or object is described in various forms of sentences. There are many advantages to using VidOR as the underlying data set. On the one hand, laborious annotation of the bounding box can be avoided. On the other hand, the relationships in the triples may simply be incorporated into the annotated sentences. For each video triple, a subject or object is selected as the object of the query, and its appearance, relationship to other objects, and visual environment are then described. For query annotation, the appearance of the queried object will be ignored. A video triplet set may correspond to multiple sentences.

After annotation, 4,808 video triples were obtained, corresponding to 80 query objects and 99,943 sentence descriptions. The average duration of the video is 28.01 seconds, and the average length of the object pipe is 9.68 seconds. The average number of words contained in the statement sentence and the question sentence was 11.12 and 8.98, respectively. Table 1 gives the statistics of these sentences.

TABLE 1 data set statistics on the number of statement sentences and question sentences

In bookIn a specific implementation of the invention, for video, 5 frames per second are first sampled, down-sampling the frame number of the super-long video to 200 frames, then pre-training the Faster R-CNN on MSCOCO (Tsung-Yi L in, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll' ar, and C L awrence Zitnick. Microsoft coco: Common objects in content. in ECCV, pages 740 42 755. Springer,2014.) extracting 20 region proposals (i.e., K20) for each frame, region feature size d_rTo 1024, it is mapped to 256 before graph modeling for the query statement, a 300-dimensional word-embedding vector is extracted using pre-trained Glove word2vec (Jeffrey Pennington, Richard Socher, and Christopher manning. Glove: Globalctors for word representation. in EMN L P, pages 1532-1543, 2014.) as for the hyper-parameters, M is set to 5, ∈ is set to 0.8, θ is set to 0.2, λ is set to 0.2, and λ is set to₁、λ₂、λ₃Set to 1.0, 0.001 and 1.0, respectively.

The number of layers T of the space-time image encoder is set to be 2; for the time locator, P is set to 8 and 8 window widths [8,16,32,64,96,128,164,196 ] are defined](ii) a Setting dimensions of parameter matrix and bias to 256, including in explicit graph convolution layers

And

w in time locator^fAnd b^fEtc.; the BiGRU network has 128-dimensional hidden states in each direction. In the training process, Adam optimizer is applied to minimize multitask loss

The initial learning rate of the model was set to 0.001 and the batch size was set to 16.

The verification results are evaluated, using m _ IoU, m _ vIoU and vIoU @ R as evaluation criteria. m _ IoU is the time-averaged cross-over ratio (IoU) between the selected segment and the real segment, S_UDefined as a collection of frames contained in a selected or real segmentWill S_IDefined as the set of frames contained in the selected segment and the real segment. The invention is provided with

To calculate the vIoU, where r^eAnd

respectively a selected region and a real region in the tth frame of the video. m _ vIoU is the average of the samples vIoU, vIoU @ R is vIoU>Sample ratio of R.

To verify the effectiveness of the present invention, the original visual localization method group R (any Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schie. group great temporal graphics in images by ECCV, pages 817 834. Springer,2016.), the video localization method STPR (Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Uhiku, and Tatsuya Harada. spatial-temporal pervison visual sources. in ICCV, pages 1453 1462,2017 cut), and STG (Zhenfafen Cheng, Wei Shen, Wei 539 2, Wenqueu, Wei Shendao, Woor, WSpr, etc.) were initially selected and compared to the original visual localization method group of the original visual localization method, and the original visual localization method WSPR 677, and the original visual localization method of the STpr was initially applied to generate a pipeline search query, and the original visual localization method of the original visual localization by using a time-supervised pipeline 677, and the STventer model, and the STcited images of the STcontained in the STarrive, the original visual localization method, the original visual localization frame, and the original visual localization method of the STpr, the original visual localization method was initially provided by the invention, the original visual localization method of STcontained by the STpr, the original pipeline.

Table 2 is a test result for a statement sentence and table 3 is a test result for an question sentence, where stgrn (greedy) generates a pipe using greedy region selection instead of a dynamic method, Random randomly selects a time segment and a spatial region, and tem.

TABLE 2

TABLE 3

As can be seen from the test results in tables 2-3:

(1) the Grounder + {. The method locates sentences independently in each frame, and the performance is worse than STPR + {. The method and WSSTG + {. The method verifies that the time dynamics of the cross frame is crucial to the space-time location; and the STGRN adopting the dynamic selection method is superior to the STGRN (greedy) adopting the greedy method, which shows that the dynamic smoothing is favorable for generating high-quality pipelines.

(2) The STGRN of the present invention has better performance than the frame-level localization methods TA LL and L-Net in terms of temporal localization, proving that space-time region modeling is effective for determining temporal boundaries of object conduits.

(3) For space-time positioning, the STGRN of the present invention has better performance on both statement sentences and question sentences than all control groups with or without real time segments, which shows that the cross-modal space-time graph inference of the present invention can effectively obtain object relationships with space-time dynamics, and the space-time positioner can accurately retrieve objects.

Next, ablation experiments were continued on a spatio-temporal region map, which is a key component of STGRN. In particular, the spatio-temporal graph includes an implicit spatial subgraph

Explicit spatial subgraph

And temporal dynamics subgraph

They are selectively discarded in this implementation to generate an ablation model and the ablation results are given in table 4, and no statement sentence and question sentence are distinguished in this implementation. From the results in table 4, the complete model of the present invention outperforms all ablation models, verifying that each sub-graph is very helpful for spatio-temporal video localization. If only subgraphs are applied, then the subgraphs are used

The model of (2) will achieve the best performance, which indicates that explicit modeling is most important for capturing object relationships. Also, if two subgraphs are used, then have

And

the model of (a) is superior to other models, which suggests that spatio-temporal modeling plays a crucial role in relation understanding and high quality video localization.

TABLE 4 ablation experimental results on VidSTG dataset

Furthermore, the number of layers T is an important hyper-parameter of the space-time diagram. The present example investigated the effect of T by changing the value of T from 1 to 5. Fig. 2 shows experimental results of m _ tlou and m _ vliou for statement sentences and question sentences. From the results, it can be seen that the performance of STGRN is best when T is set to 2. Single-layer graphs do not adequately capture object relationships and temporal dynamics. Too many layers may result in regions that are overly smooth, i.e., each region tends to be characterized the same. The performance variation across different standards and sentence types is essentially consistent, which illustrates that the effect of T is stable.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. The method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network is characterized by comprising the following steps of:

2. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S1 specifically comprises:

Each region

There are two attributes: one is a region feature vector

Visual feature vector representing ith area of t frame in video，d_rRepresenting the dimensions of the region feature vectors; the other is a region bounding box vector

Wherein

And

and

Wherein f is^tThe visual characteristics of the t-th frame in the video are shown, and N represents the frame number of the video.

3. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S2 specifically comprises:

Wherein s is_iIs a semantic feature of the ith word, L represents a word in a query statementNumber, d_sA dimension representing a semantic feature vector of a word;

from a set of word semantic features

Selecting semantic feature se of query object, and obtaining entity sensitive feature s by attention method^aForming a query feature s^qThe formula is as follows:

s^q＝[s^e；s^a]

wherein,

and

4. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S3 specifically comprises:

Explicit spatial subgraph in each frame

And cross-frame temporal dynamic subgraph

Wherein

Is the vertex of each sub-graph, and the three sub-graphs all regard the region in each corresponding video frame as the vertex v;_imp、_exp、_temrespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;

for the

Calculating the characteristics of the region sensitive text, wherein the formula is as follows:

wherein,

and

is a parameter matrix, b^mIs a bias that is a function of the bias,

is a row vector of the parameters which,

to represent

And s_jThe degree of similarity of (a) to (b),

the weight of attention is represented as a weight of attention,

establishing a textbook guided by language information, wherein the formula is as follows:

where, σ is the sigmoid function,

indicating area

Textbook of d_rRepresenting the dimensions of the region feature vectors;

will be provided with

Connected to obtain cross-modal region features

The formula is as follows:

wherein, ⊙ is the multiplication element by element,

3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer;

the working steps of the space map convolutional layer are as follows:

for cross-modal region features, first sub-map in implicit space

The implicit graph convolution is adopted, and the formula is as follows:

wherein,

is that

Neutralization of

The area of the connection is the area of the connection,

representing the output of the implicit spatial map convolutional layer;

then in the explicit space subgraph

The above adopts explicit graph convolution, and the formula is as follows:

α^exp＝Softmax(w^rs^q+b^mm)

wherein,

is an optional parameter matrix, lab (i, j) is the label of edge (i, j),

is an optional offset that is set off by the user,

is that

Neutralization of

weights corresponding to 51 tags;

a relational weight representing label selection by edge (i, j);

the working steps of the time chart convolution layer are as follows:

dynamic subgraph in time

The time chart convolution is adopted, and the formula is as follows:

wherein,

and

is a region

The semantic coefficients of each of the neighborhoods,

Performing multi-step coding by a space-time diagram encoder with T space-time convolution layers to obtain final relation sensitive region characteristics

5. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said implicit spatial subgraph is a spatial subgraph_impThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.

6. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said explicit spatial subgraph is a spatial subgraph_expThe construction method comprises the following steps:

extracting region triplets in each video frame

As

To

Wherein, wherein

And

is the ith area and the jth area of the tth frame in the video,

is that

And

the relationship predicate between the two, namely the label of the edge;

given the characteristics of the region i

Characteristics of region j

Joint feature of a joint region of two regions

And

the relationship predicates between.

7. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein the temporal dynamics subgraph is a spatiotemporal graph inference network_temThe construction method comprises the following steps:

for the

8. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the spatio-temporal locator of step S4 comprises a temporal locator and a spatial locator, specifically as follows:

wherein m is^tRepresenting the t-th frame in videoA relationship-sensitive feature;

a row vector of the parameters is represented,

representing a parameter matrix, b^fRepresents a bias;

gathering frame-level relation sensitive features in video

Set of visual features of video frame corresponding thereto

Concatenate and use another BiGRU to learn the final frame feature set

Next, at each frame t, a multiscale candidate clip is defined as

Wherein

C^t＝σ(W^c[h^t；s^q]+b^c)

^t＝W^o[h^t；s^q]+b^o

wherein,

the confidence scores corresponding to the P candidate clips at frame t,

wherein,

adjustment has the highest

The boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)_s，_e) First, according to the real boundary

Calculating the offset of the clip

And

the regression loss formula is as follows:

wherein R represents a smoothing L1 function;

By characterizing the query sq and the final frame feature h^tPerforming fusion to estimate matching score of each region

The formula is as follows:

wherein,

the space loss formula is as follows:

wherein S is_tIs the set of frames in real time,

is a region

Cross-over ratio between the spatial and real areas.

9. The method for solving the task of polymorphic sentence video localization according to claim 8 using spatio-temporal graph inference network, wherein the multitask penalty function is as follows:

10. The method for solving the task of video localization in polymorphic sentences using spatio-temporal graph inference network as claimed in claim 1, wherein in said step S6, the formula for calculating the link score is as follows:

where s (-) represents the link score,

and

is a region

And

θ is a balance scalar, IoU (-) is an intersection-to-parallel ratio function;

the energy is calculated as follows:

wherein E (-) represents energy and (T)_e，T_s) Is a time boundary, Y represents a pipeline, and directly uses Vitervi algorithm to obtain a region set which maximizes E (Y)

As the final pipe Y.