CN111414845A - Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network - Google Patents

Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network Download PDF

Info

Publication number
CN111414845A
CN111414845A CN202010191264.5A CN202010191264A CN111414845A CN 111414845 A CN111414845 A CN 111414845A CN 202010191264 A CN202010191264 A CN 202010191264A CN 111414845 A CN111414845 A CN 111414845A
Authority
CN
China
Prior art keywords
video
frame
region
time
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010191264.5A
Other languages
Chinese (zh)
Other versions
CN111414845B (en
Inventor
赵洲
张品涵
张竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010191264.5A priority Critical patent/CN111414845B/en
Publication of CN111414845A publication Critical patent/CN111414845A/en
Application granted granted Critical
Publication of CN111414845B publication Critical patent/CN111414845B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Operations Research (AREA)
  • Multimedia (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for solving a polymorphic sentence video positioning task by a space-time graph reasoning network, belonging to the field of natural language visual positioning. The invention firstly analyzes the video into a space-time region graph, which not only has an implicit and explicit spatial sub-graph of each frame, but also has a cross-frame time dynamic sub-graph. Next, a text cue is added to the spatio-temporal region graph, and a multi-step cross-modal graph inference is established. The multi-step process may support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline. The invention does not need to prune the video when positioning the natural language, thus reducing the cost of video positioning; the method can effectively process the question sentences and the statement sentences, provides technical support for the combination research of higher-level natural language processing and computational vision (such as video question answering) and has wide application prospect.

Description

Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network
Technical Field
The invention relates to the field of natural language visual positioning, in particular to a method for solving a polymorphic sentence video positioning task by using a space-time graph reasoning network.
Background
Visual localization of natural language is a fundamental and crucial task in the field of visual understanding. The goal of this task is to locate objects described by a given natural language in visual content from a temporal, spatial perspective. In recent years, researchers have begun to focus on natural language (sentence) localization in video, including temporal localization and spatio-temporal localization. The time positioning can obtain the time slice of the object appearing in the video; the space-time positioning is to obtain the region where the object appears based on the time positioning, and the set of regions where such a series of objects are located is also called space-time tube (spatial-temporal tube) because of the temporal and spatial continuity.
At present, the method realized by people is less and has stronger limitation. Existing video localization methods often extract a set of spatio-temporal pipelines from the pruned video and then identify the target pipeline that matches the sentence. However, this framework may not be able to accomplish Spatio-Temporal Video localization (STVG) for polymorphic Sentences. On the one hand, the performance of the framework depends largely on the quality of the candidate pipeline, but it is difficult to generate a high-quality pipeline in advance without textual clues, because sentences may describe the short-term state of the object in a very small segment, but the existing pipeline pre-generation framework can only produce a complete object pipeline in the pruned video. On the other hand, these methods only consider single-pipe modeling, and ignore the relationships between objects, and thus cannot process question sentences with position objects, but can only process traditional statement sentences. However, object relationships are an important clue for STVG tasks, especially for interrogatories that may only provide interaction of unknown objects with other objects, because of lack of explicit characteristics of the objects, locating interrogatories can only depend on relationships (e.g., action relationships and spatial relationships) between unknown objects and other objects, and it is important to explain the construction of relationship models and cross-modal relationship reasoning. Therefore, the existing method cannot process STVG tasks.
In addition, the existing visual map modeling method often constructs a spatial map in an image, and cannot distinguish subtle differences of object actions, such as door opening and door closing, by utilizing time dynamic information in a video. Therefore, there is a need for a method that can solve the task of video localization of polymorphic sentences, locating the spatio-temporal pipeline of the queried object for a given un-cropped video and a statement or question regarding the descriptive object.
Disclosure of Invention
Aiming at the defect that the prior art can not solve the video positioning task of polymorphic sentences, the invention provides a method for solving the video positioning task of polymorphic sentences by using a space-time graph inference network. The spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the time dynamic subgraph can take the dynamic property of the object and the cross-frame transformation into consideration so as to further improve the understanding of the network to the relation between the objects. Next, text cues are added to the space-time region graph, a multi-step cross-modal graph inference is established, and a multi-step process can support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network comprises the following steps:
s1: aiming at a section of video, extracting the visual characteristics of each frame in the video by using a Faster-RCNN network to form a visual characteristic set of video frames; extracting K regions from each video frame to obtain region characteristic vectors and region frame vectors to form a region set of a frame level in the video;
s2: aiming at a query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally further obtaining the query feature of the query statement by adopting an attention method;
s3: establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing a video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain a cross-modal regional feature; then, performing T-step convolution operation on the space-time region graph through T space-time convolution layers according to the cross-modal region characteristics to finally obtain relationship sensitive region characteristics;
s4: establishing a space-time locator comprising a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; integrating the query features of the query statement and the final frame features through a space locator to obtain a matching score of each region in each video frame;
s5: the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
s6: and (5) screening the frame t and the area i corresponding to the highest matching score obtained in the step (S5), calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of the pipeline according to the link score, and obtaining the space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
Further, the step S1 specifically includes:
for a piece of video, extracting K regions from each video frame by using pre-trained fast-RCNN to obtain a region set of a frame level in the video
Figure BDA0002415992330000031
Each region
Figure BDA0002415992330000032
There are two attributes: one is a region feature vector
Figure BDA0002415992330000033
Figure BDA0002415992330000034
Visual feature vector representing the ith region of the t-th frame in video, drRepresenting the dimensions of the region feature vectors; the other is a region bounding box vector
Figure BDA0002415992330000035
wherein
Figure BDA0002415992330000036
And
Figure BDA0002415992330000037
respectively representing the abscissa and the ordinate of the central point of the bounding box of the ith area of the t-th frame in the video,
Figure BDA0002415992330000038
and
Figure BDA0002415992330000039
respectively representing the width and the height of a bounding box of the ith area of the t frame in the video;
in addition, the visual features of each frame in the video are extracted by the fast-RCNN to form a visual feature set of the video frames
Figure BDA00024159923300000310
wherein ftThe visual characteristics of the t-th frame in the video are shown, and N represents the frame number of the video.
Further, the step S2 specifically includes:
aiming at a query sentence, firstly, a GloVe network is adopted to obtain a word embedding vector of each word in the query sentence, then a BiGRU network is adopted to obtain word semantic features of the query sentence, and a word semantic feature set is formed
Figure BDA00024159923300000311
wherein ,siIs the semantic feature of the ith word and is formed by connecting the forward hidden state and the backward hidden state of the ith node of the BiGRU network, L represents the number of words in the query sentence, dsA dimension representing a semantic feature vector of a word;
from a set of word semantic features
Figure BDA0002415992330000041
In selecting semantic features s of query objectseObtaining entity sensitivity s by attention methodaSaid entity sensitive characteristic saSemantic features s with query objectseComposing query features sq(ii) a The formula is as follows:
Figure BDA0002415992330000042
Figure BDA0002415992330000043
sq=[se;sa]
wherein ,
Figure BDA0002415992330000044
and
Figure BDA0002415992330000045
is a parameter matrix, saIs an entity sensitive feature, sqIs a query feature, gammaiRepresenting normalized weights.
Further, the step S3 specifically includes:
establishing a space-time graph encoder, which comprises a video analysis layer, a cross-mode fusion layer and T space-time convolution layers, wherein the working steps of the space-time graph encoder are as follows:
3.1) resolving the video into a space-time region graph through a video resolving layer, wherein the space-time region graph comprises three sub-graphs: implicit spatial subgraph in each frame
Figure BDA0002415992330000046
Explicit spatial subgraph in each frame
Figure BDA0002415992330000047
And cross-frame temporal dynamic subgraph
Figure BDA0002415992330000048
Wherein
Figure BDA0002415992330000049
Is the vertex of each sub-graph, and all three sub-graphs regard the region in each corresponding video frame as its vertex
Figure BDA00024159923300000410
impexptemRespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;
3.2) fusing the region feature vector obtained in the step S1 and the word semantic features obtained in the step S2 through a cross-modal fusion layer to obtain cross-modal region features, which are specifically as follows:
for ith area of t frame in video
Figure BDA00024159923300000411
Computing word semantic features
Figure BDA00024159923300000412
The attention weight of (a) is given,further obtaining the regional sensitive text characteristics, wherein the formula is as follows:
Figure BDA00024159923300000422
Figure BDA00024159923300000413
Figure BDA00024159923300000414
wherein,
Figure BDA00024159923300000415
and
Figure BDA00024159923300000416
is a parameter matrix, bmIs an offset, wTIs a row vector of the parameters which,
Figure BDA00024159923300000417
to represent
Figure BDA00024159923300000418
And sjThe degree of similarity of (a) to (b),
Figure BDA00024159923300000419
the weight of attention is represented as a weight of attention,
Figure BDA00024159923300000420
is the regional sensitive text feature of the ith region of the t frame in the video;
establishing a text gate guided by language information and sensitive to regional text characteristics
Figure BDA00024159923300000421
The text-independent region is attenuated, and the formula is as follows:
Figure BDA0002415992330000051
where, σ is the sigmoid function,
Figure BDA0002415992330000052
indicating area
Figure BDA0002415992330000053
Textbook of drRepresenting the dimensions of the region feature vectors;
will be provided with
Figure BDA0002415992330000054
Connected to obtain cross-modal region features
Figure BDA0002415992330000055
The formula is as follows:
Figure BDA0002415992330000056
wherein, ⊙ is the multiplication element by element,
Figure BDA0002415992330000057
the cross-modal region feature of the ith region of the tth frame in the video is represented;
3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer; the space map convolutional layer is used for obtaining the visual relationship between each frame region, and the specific details are as follows:
for cross-modal region features, first sub-map in implicit space
Figure BDA0002415992330000058
The implicit graph convolution is adopted, and the formula is as follows:
Figure BDA0002415992330000059
Figure BDA00024159923300000510
wherein,
Figure BDA00024159923300000511
is that
Figure BDA00024159923300000512
Neutralization of
Figure BDA00024159923300000513
The area of the connection is the area of the connection,
Figure BDA00024159923300000514
is a weight parameter, WimpAnd UimpA matrix of the parameters is represented and,
Figure BDA00024159923300000515
representing the output of the implicit spatial map convolutional layer;
then in the explicit space subgraph
Figure BDA00024159923300000516
The above adopts explicit graph convolution, and the formula is as follows:
Figure BDA00024159923300000517
αexp=Softmax(Wrsq+bm)
wherein,
Figure BDA00024159923300000518
representing the output of the explicit spatial map convolutional layer, dir (i, j) is the direction of the edge (i, j),
Figure BDA00024159923300000519
is an optional parameter matrix, lab (i, j) is the label of edge (i, j),
Figure BDA00024159923300000520
is an optional offset that is set off by the user,
Figure BDA00024159923300000521
is that
Figure BDA00024159923300000522
Neutralization of
Figure BDA00024159923300000523
Region of connection, WrIs a parameter matrix, bmIs offset αexpIs a coefficient of relationship, and
Figure BDA00024159923300000524
weights corresponding to 51 tags;
Figure BDA00024159923300000528
a relational weight representing label selection by edge (i, j);
the time map convolutional layer is used for obtaining the dynamic property and the transformational property of the cross-frame object, and specifically comprises the following steps:
dynamic subgraph in time
Figure BDA00024159923300000526
The time chart convolution is adopted, and the formula is as follows:
Figure BDA00024159923300000527
Figure BDA0002415992330000061
wherein,
Figure BDA0002415992330000062
and
Figure BDA0002415992330000063
is a parameter matrix, dir (i, j) indicates the direction in which the edge (i, j) of the corresponding parameter matrix is selected,
Figure BDA0002415992330000064
is a region
Figure BDA0002415992330000065
The semantic coefficients of each of the neighborhoods,
Figure BDA0002415992330000066
representing the output, U, of the time-graph convolution layertemRepresenting a parameter matrix;
combining the outputs of the space-time convolution layer and the time-map convolution layer to obtain a result for the first space-time convolution layer
Figure BDA0002415992330000067
Figure BDA0002415992330000068
Multi-step encoding (will be) by a spatio-temporal image encoder with T spatio-temporal convolutional layers
Figure BDA0002415992330000069
Is obtained as an input
Figure BDA00024159923300000610
) Obtaining the final relation sensitive area characteristics
Figure BDA00024159923300000611
Figure BDA00024159923300000612
And the relation sensitive area characteristic of the ith area of the tth frame in the video is shown.
In implicit spatial subgraphsimpThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.
Of explicit spatial subgraphsexpThe construction method comprises the following steps:
extracting region triplets in each video frame
Figure BDA00024159923300000613
As
Figure BDA00024159923300000614
To
Figure BDA00024159923300000615
Wherein, wherein
Figure BDA00024159923300000616
And
Figure BDA00024159923300000617
is the ith area and the jth area of the tth frame in the video,
Figure BDA00024159923300000618
is that
Figure BDA00024159923300000619
And
Figure BDA00024159923300000620
the relationship predicate between the two, namely the label of the edge;
given the characteristics of the region i
Figure BDA00024159923300000621
Characteristics of region j
Figure BDA00024159923300000622
Joint feature of a joint region of two regions
Figure BDA00024159923300000623
The joint characteristics of the joint region are also obtained by fast-RCNN; inputting the three characteristics into a classifier trained on a Visual Genome data set in advance, and predicting to obtain a region
Figure BDA00024159923300000624
And
Figure BDA00024159923300000625
the relationship predicates between.
In a temporal dynamic subgraphtemThe construction method comprises the following steps:
calculating the connection scores between each video frame and the regions in the M adjacent forward frames and M adjacent backward frames:
Figure BDA00024159923300000626
where cos (-) is the cosine similarity of the two features, IoU (-) is the intersection ratio of the two regions, ∈ is the balance scalar,
Figure BDA00024159923300000627
representing the connection fraction between the ith area of the t frame and the jth area of the k frame in the video;
for the
Figure BDA0002415992330000071
Selecting the region with the highest connection score from the k frame of the video
Figure BDA0002415992330000072
To construct an edge, each region obtaining a 2M +1 edge including a self-loop; edges of temporal dynamics graphtemNo mark, with three orientations: forward, backward and self-looping.
Further, the step S4 specifically includes:
4.1) set up the time locator, first aggregate the relationship sensitive region features to the frame level by the attention mechanism, for the query feature sqFrame level relationship sensitive feature m in videotExpressed as:
Figure BDA0002415992330000073
wherein m istRepresenting the relation sensitive characteristic of the t frame in the video; w is aTA row vector of the parameters is represented,
Figure BDA0002415992330000074
representing a parameter matrix, bfRepresents a bias;
will be in the videoFrame-level set of relationship-sensitive features
Figure BDA0002415992330000075
Set of visual features of video frame corresponding thereto
Figure BDA0002415992330000076
Concatenate and use another BiGRU to learn the final frame feature set
Figure BDA0002415992330000077
Next, at each frame t, a multiscale candidate clip is defined as
Figure BDA0002415992330000078
Wherein
Figure BDA0002415992330000079
Is the starting and ending boundary, w, of the kth clip of the t-th frame in the videokIs the width of the kth clip, P is the number of clips; then, all candidate clips are estimated through the sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Ct=σ(Wc[ht;sq]+bc)
t=Wo[ht;sq]+bo
wherein,
Figure BDA00024159923300000710
the confidence scores corresponding to the P candidate clips at frame t,
Figure BDA00024159923300000711
is the offset of P segments, WcAnd WoIs a parameter matrix, bcAnd boIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: clipping selected alignment loss and boundary adjusted regression loss; the alignment loss formula is as follows:
Figure BDA00024159923300000712
wherein,
Figure BDA00024159923300000713
representing the temporal intersection ratio of the kth candidate clip with the correct clip at frame t,
Figure BDA00024159923300000714
is represented by CtThe kth element of (1), i.e., the confidence score of the kth candidate clip at frame t;
next, the fine tuning has the highest
Figure BDA00024159923300000715
The boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)s,e) First, according to the real boundary
Figure BDA00024159923300000716
Calculating the offset of the clip
Figure BDA00024159923300000717
And
Figure BDA00024159923300000718
the regression loss formula is as follows:
Figure BDA0002415992330000081
wherein R represents a smoothing L1 function;
4.2) establishing a spatial locator for locating the target area in each frame, the relation sensitive area being characterized by
Figure BDA0002415992330000082
By characterizing the query as sqAnd final frame feature htPerforming fusion to estimate matching score of each region
Figure BDA0002415992330000083
The formula is as follows:
Figure BDA0002415992330000084
wherein,
Figure BDA0002415992330000085
is the matching score of the ith area of the t frame in the video, sigma (-) is sigmoid function, WcIs a parameter matrix, bcIs an offset;
the space loss formula is as follows:
Figure BDA0002415992330000086
wherein S istIs the set of frames in real time,
Figure BDA0002415992330000087
is a region
Figure BDA0002415992330000088
Cross-over ratio between spatial and real area;
the multitask penalty function is as follows:
Figure BDA0002415992330000089
wherein λ is1、λ2And λ3Is a hyper-parameter that controls the balance between the three losses.
Further, the step S5 specifically includes:
the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
further, the step S6 specifically includes:
the calculation formula of the link score is as follows:
Figure BDA00024159923300000810
where s (-) represents the link score,
Figure BDA00024159923300000811
and
Figure BDA00024159923300000812
is a region
Figure BDA00024159923300000813
And
Figure BDA00024159923300000814
θ is a balance scalar set to 0.2, IoU (-) is a cross-over function;
the energy is calculated as follows:
Figure BDA00024159923300000815
wherein E (-) represents energy and (T)e,Ts) Is a time boundary, Y represents a pipeline, and directly uses Vitervi algorithm to obtain a region set which maximizes E (Y)
Figure BDA0002415992330000091
As the final pipe Y.
The invention has the following beneficial effects:
(1) the invention does not need to prune the video when positioning the natural language, can directly process the long video, and reduces the cost of video positioning;
(2) the space-time region graph obtained in the visual graph modeling process not only has implicit and explicit spatial subgraphs of each frame, but also has cross-frame time dynamic subgraphs; the spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the temporal dynamic subgraph can take the dynamic property and the cross-frame transformation of the object into consideration, effectively utilizes the temporal dynamic information in the video to distinguish the subtle difference of the action of the object, so as to further improve the understanding of the network to the relation between the objects;
(3) the present invention introduces a spatio-temporal locator to retrieve the spatio-temporal pipeline of objects directly from the region. Specifically, a time locator is used for determining the time boundary of the pipeline, then a space locator with a dynamic selection method is used for locating an object in each frame and generating a smooth pipeline, question sentences and statement sentences can be effectively processed, video location of polymorphic sentences is realized, a large number of experiments prove the effectiveness of the method, and technical support is provided for higher-level natural language processing and computational vision combined research (such as video question answering);
(4) the invention has wide application prospect, such as directly searching video content and classifying video by using characters.
Drawings
FIG. 1 is a schematic diagram of the STGRN structure of the present invention;
fig. 2 is an experimental result of the criteria m _ tilou and m _ vliou for statement sentences and question sentences.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the present invention aims at a section of video and input sentences, and locates the objects in the sentences in each frame by using a method of solving the polymorphic sentence video location task by using a space-time graph inference network, and generates a smooth pipeline, and the specific steps are as follows:
the method comprises the steps that firstly, visual features of each frame in a video are extracted by utilizing a Faster-RCNN network aiming at a section of video, and a visual feature set of video frames is formed; and extracting K regions from each video frame to obtain region feature vectors and region frame vectors, and forming a region set of a frame level in the video.
And secondly, aiming at the query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then, obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally, further obtaining the query feature of the query statement by adopting an attention method.
Step three, establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing the video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain cross-modal regional features; and performing T-step convolution operation on the space-time region diagram through T space-time convolution layers according to the cross-modal region characteristics to finally obtain the relation sensitive region characteristics.
Step four, establishing a space-time locator which comprises a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; and then integrating the query features of the query statement and the final frame features through a spatial locator to obtain the matching score of each region in each video frame.
Step five, the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; and for the section of video processed in the step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN.
And step six, screening the frame t and the area i corresponding to the highest matching score obtained in the step five, calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of a pipeline according to the link score, and obtaining a space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
Examples
The invention establishes a large-scale space-time video positioning data set VidSTG by adding sentence annotation on VidOR (Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xin Yang, and Tat-Seng Chua. inhibiting objects and relationships in used generated video ICMR, pages 279-287. ACM,2019.), and carries out verification on the VidSTG data set. VidOR is the largest existing video data set containing object relationships, containing 10,000 videos and objects and fine-grained annotation of the relationships therein. VidOR annotates 80 object classes with dense bounding boxes and 50 relationship predicate classes (8 spatial relationships and 42 action relationships) between objects, representing the relationships as triples < subject, predicate, object >, each associated with a time boundary and a space-time pipe (to which the subject and object belong). Appropriate triples are selected based on VidOR and the subject or object is described in various forms of sentences. There are many advantages to using VidOR as the underlying data set. On the one hand, laborious annotation of the bounding box can be avoided. On the other hand, the relationships in the triples may simply be incorporated into the annotated sentences. For each video triple, a subject or object is selected as the object of the query, and its appearance, relationship to other objects, and visual environment are then described. For query annotation, the appearance of the queried object will be ignored. A video triplet set may correspond to multiple sentences.
After annotation, 4,808 video triples were obtained, corresponding to 80 query objects and 99,943 sentence descriptions. The average duration of the video is 28.01 seconds, and the average length of the object pipe is 9.68 seconds. The average number of words contained in the statement sentence and the question sentence was 11.12 and 8.98, respectively. Table 1 gives the statistics of these sentences.
TABLE 1 data set statistics on the number of statement sentences and question sentences
Figure BDA0002415992330000111
In bookIn a specific implementation of the invention, for video, 5 frames per second are first sampled, down-sampling the frame number of the super-long video to 200 frames, then pre-training the Faster R-CNN on MSCOCO (Tsung-Yi L in, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll' ar, and C L awrence Zitnick. Microsoft coco: Common objects in content. in ECCV, pages 740 42 755. Springer,2014.) extracting 20 region proposals (i.e., K20) for each frame, region feature size drTo 1024, it is mapped to 256 before graph modeling for the query statement, a 300-dimensional word-embedding vector is extracted using pre-trained Glove word2vec (Jeffrey Pennington, Richard Socher, and Christopher manning. Glove: Globalctors for word representation. in EMN L P, pages 1532-1543, 2014.) as for the hyper-parameters, M is set to 5, ∈ is set to 0.8, θ is set to 0.2, λ is set to 0.2, and λ is set to1、λ2、λ3Set to 1.0, 0.001 and 1.0, respectively.
The number of layers T of the space-time image encoder is set to be 2; for the time locator, P is set to 8 and 8 window widths [8,16,32,64,96,128,164,196 ] are defined](ii) a Setting dimensions of parameter matrix and bias to 256, including in explicit graph convolution layers
Figure BDA0002415992330000121
And
Figure BDA0002415992330000122
w in time locatorfAnd bfEtc.; the BiGRU network has 128-dimensional hidden states in each direction. In the training process, Adam optimizer is applied to minimize multitask loss
Figure BDA0002415992330000123
The initial learning rate of the model was set to 0.001 and the batch size was set to 16.
The verification results are evaluated, using m _ IoU, m _ vIoU and vIoU @ R as evaluation criteria. m _ IoU is the time-averaged cross-over ratio (IoU) between the selected segment and the real segment, SUDefined as a collection of frames contained in a selected or real segmentWill SIDefined as the set of frames contained in the selected segment and the real segment. The invention is provided with
Figure BDA0002415992330000124
To calculate the vIoU, where reAnd
Figure BDA0002415992330000125
respectively a selected region and a real region in the tth frame of the video. m _ vIoU is the average of the samples vIoU, vIoU @ R is vIoU>Sample ratio of R.
To verify the effectiveness of the present invention, the original visual localization method group R (any Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schie. group great temporal graphics in images by ECCV, pages 817 834. Springer,2016.), the video localization method STPR (Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Uhiku, and Tatsuya Harada. spatial-temporal pervison visual sources. in ICCV, pages 1453 1462,2017 cut), and STG (Zhenfafen Cheng, Wei Shen, Wei 539 2, Wenqueu, Wei Shendao, Woor, WSpr, etc.) were initially selected and compared to the original visual localization method group of the original visual localization method, and the original visual localization method WSPR 677, and the original visual localization method of the STpr was initially applied to generate a pipeline search query, and the original visual localization method of the original visual localization by using a time-supervised pipeline 677, and the STventer model, and the STcited images of the STcontained in the STarrive, the original visual localization method, the original visual localization frame, and the original visual localization method of the STpr, the original visual localization method was initially provided by the invention, the original visual localization method of STcontained by the STpr, the original pipeline.
Table 2 is a test result for a statement sentence and table 3 is a test result for an question sentence, where stgrn (greedy) generates a pipe using greedy region selection instead of a dynamic method, Random randomly selects a time segment and a spatial region, and tem.
TABLE 2
Figure BDA0002415992330000131
TABLE 3
Figure BDA0002415992330000132
As can be seen from the test results in tables 2-3:
(1) the Grounder + {. The method locates sentences independently in each frame, and the performance is worse than STPR + {. The method and WSSTG + {. The method verifies that the time dynamics of the cross frame is crucial to the space-time location; and the STGRN adopting the dynamic selection method is superior to the STGRN (greedy) adopting the greedy method, which shows that the dynamic smoothing is favorable for generating high-quality pipelines.
(2) The STGRN of the present invention has better performance than the frame-level localization methods TA LL and L-Net in terms of temporal localization, proving that space-time region modeling is effective for determining temporal boundaries of object conduits.
(3) For space-time positioning, the STGRN of the present invention has better performance on both statement sentences and question sentences than all control groups with or without real time segments, which shows that the cross-modal space-time graph inference of the present invention can effectively obtain object relationships with space-time dynamics, and the space-time positioner can accurately retrieve objects.
Next, ablation experiments were continued on a spatio-temporal region map, which is a key component of STGRN. In particular, the spatio-temporal graph includes an implicit spatial subgraph
Figure BDA0002415992330000141
Explicit spatial subgraph
Figure BDA0002415992330000142
And temporal dynamics subgraph
Figure BDA0002415992330000143
They are selectively discarded in this implementation to generate an ablation model and the ablation results are given in table 4, and no statement sentence and question sentence are distinguished in this implementation. From the results in table 4, the complete model of the present invention outperforms all ablation models, verifying that each sub-graph is very helpful for spatio-temporal video localization. If only subgraphs are applied, then the subgraphs are used
Figure BDA0002415992330000144
The model of (2) will achieve the best performance, which indicates that explicit modeling is most important for capturing object relationships. Also, if two subgraphs are used, then have
Figure BDA0002415992330000145
And
Figure BDA0002415992330000146
the model of (a) is superior to other models, which suggests that spatio-temporal modeling plays a crucial role in relation understanding and high quality video localization.
TABLE 4 ablation experimental results on VidSTG dataset
Figure BDA0002415992330000147
Furthermore, the number of layers T is an important hyper-parameter of the space-time diagram. The present example investigated the effect of T by changing the value of T from 1 to 5. Fig. 2 shows experimental results of m _ tlou and m _ vliou for statement sentences and question sentences. From the results, it can be seen that the performance of STGRN is best when T is set to 2. Single-layer graphs do not adequately capture object relationships and temporal dynamics. Too many layers may result in regions that are overly smooth, i.e., each region tends to be characterized the same. The performance variation across different standards and sentence types is essentially consistent, which illustrates that the effect of T is stable.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. The method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network is characterized by comprising the following steps of:
s1: aiming at a section of video, extracting the visual characteristics of each frame in the video by using a Faster-RCNN network to form a visual characteristic set of video frames; extracting K regions from each video frame to obtain region characteristic vectors and region frame vectors to form a region set of a frame level in the video;
s2: aiming at a query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally further obtaining the query feature of the query statement by adopting an attention method;
s3: establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing a video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain a cross-modal regional feature; then, performing T-step convolution operation on the space-time region graph through T space-time convolution layers according to the cross-modal region characteristics to finally obtain relationship sensitive region characteristics;
s4: establishing a space-time locator comprising a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; integrating the query features of the query statement and the final frame features through a space locator to obtain a matching score of each region in each video frame;
s5: the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
s6: and (5) screening the frame t and the area i corresponding to the highest matching score obtained in the step (S5), calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of the pipeline according to the link score, and obtaining the space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
2. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S1 specifically comprises:
for a piece of video, extracting K regions from each video frame by using pre-trained fast-RCNN to obtain a region set of a frame level in the video
Figure FDA0002415992320000021
Each region
Figure FDA0002415992320000022
There are two attributes: one is a region feature vector
Figure FDA0002415992320000023
Figure FDA0002415992320000024
Visual feature vector representing ith area of t frame in video,drRepresenting the dimensions of the region feature vectors; the other is a region bounding box vector
Figure FDA0002415992320000025
Wherein
Figure FDA0002415992320000026
And
Figure FDA0002415992320000027
respectively representing the abscissa and the ordinate of the central point of the bounding box of the ith area of the t-th frame in the video,
Figure FDA0002415992320000028
and
Figure FDA0002415992320000029
respectively representing the width and the height of a bounding box of the ith area of the t frame in the video;
in addition, the visual features of each frame in the video are extracted by the fast-RCNN to form a visual feature set of the video frames
Figure FDA00024159923200000210
Wherein f istThe visual characteristics of the t-th frame in the video are shown, and N represents the frame number of the video.
3. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S2 specifically comprises:
aiming at a query sentence, firstly, a GloVe network is adopted to obtain a word embedding vector of each word in the query sentence, then a BiGRU network is adopted to obtain word semantic features of the query sentence, and a word semantic feature set is formed
Figure FDA00024159923200000211
Wherein s isiIs a semantic feature of the ith word, L represents a word in a query statementNumber, dsA dimension representing a semantic feature vector of a word;
from a set of word semantic features
Figure FDA00024159923200000212
Selecting semantic feature se of query object, and obtaining entity sensitive feature s by attention methodaForming a query feature sqThe formula is as follows:
Figure FDA00024159923200000213
Figure FDA00024159923200000214
sq=[se;sa]
wherein,
Figure FDA00024159923200000215
and
Figure FDA00024159923200000216
is a parameter matrix, saIs an entity sensitive feature, sqIs a query feature, gammaiRepresenting normalized weights.
4. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S3 specifically comprises:
establishing a space-time graph encoder, which comprises a video analysis layer, a cross-mode fusion layer and T space-time convolution layers, wherein the working steps of the space-time graph encoder are as follows:
3.1) resolving the video into a space-time region graph through a video resolving layer, wherein the space-time region graph comprises three sub-graphs: implicit spatial subgraph in each frame
Figure FDA00024159923200000217
Explicit spatial subgraph in each frame
Figure FDA00024159923200000218
And cross-frame temporal dynamic subgraph
Figure FDA00024159923200000219
Wherein
Figure FDA00024159923200000220
Is the vertex of each sub-graph, and the three sub-graphs all regard the region in each corresponding video frame as the vertex v;impexptemrespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;
3.2) fusing the region feature vector obtained in the step S1 and the word semantic features obtained in the step S2 through a cross-modal fusion layer to obtain cross-modal region features, which are specifically as follows:
for the
Figure FDA0002415992320000031
Calculating the characteristics of the region sensitive text, wherein the formula is as follows:
Figure FDA0002415992320000032
Figure FDA0002415992320000033
Figure FDA0002415992320000034
wherein,
Figure FDA0002415992320000035
and
Figure FDA0002415992320000036
is a parameter matrix, bmIs a bias that is a function of the bias,
Figure FDA00024159923200000326
is a row vector of the parameters which,
Figure FDA0002415992320000037
to represent
Figure FDA0002415992320000038
And sjThe degree of similarity of (a) to (b),
Figure FDA0002415992320000039
the weight of attention is represented as a weight of attention,
Figure FDA00024159923200000310
is the regional sensitive text feature of the ith region of the t frame in the video;
establishing a textbook guided by language information, wherein the formula is as follows:
Figure FDA00024159923200000311
where, σ is the sigmoid function,
Figure FDA00024159923200000312
indicating area
Figure FDA00024159923200000313
Textbook of drRepresenting the dimensions of the region feature vectors;
will be provided with
Figure FDA00024159923200000314
Connected to obtain cross-modal region features
Figure FDA00024159923200000315
The formula is as follows:
Figure FDA00024159923200000316
wherein, ⊙ is the multiplication element by element,
Figure FDA00024159923200000317
the cross-modal region feature of the ith region of the tth frame in the video is represented;
3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer;
the working steps of the space map convolutional layer are as follows:
for cross-modal region features, first sub-map in implicit space
Figure FDA00024159923200000318
The implicit graph convolution is adopted, and the formula is as follows:
Figure FDA00024159923200000319
Figure FDA00024159923200000320
wherein,
Figure FDA00024159923200000321
is that
Figure FDA00024159923200000322
Neutralization of
Figure FDA00024159923200000323
The area of the connection is the area of the connection,
Figure FDA00024159923200000324
is a weight parameter, wimpAnd uimpA matrix of the parameters is represented and,
Figure FDA00024159923200000325
representing the output of the implicit spatial map convolutional layer;
then in the explicit space subgraph
Figure FDA0002415992320000041
The above adopts explicit graph convolution, and the formula is as follows:
Figure FDA0002415992320000042
αexp=Softmax(wrsq+bmm)
wherein,
Figure FDA0002415992320000043
representing the output of the explicit spatial map convolutional layer, dir (i, j) is the direction of the edge (i, j),
Figure FDA0002415992320000044
is an optional parameter matrix, lab (i, j) is the label of edge (i, j),
Figure FDA0002415992320000045
is an optional offset that is set off by the user,
Figure FDA0002415992320000046
is that
Figure FDA0002415992320000047
Neutralization of
Figure FDA0002415992320000048
Region of connection, wrIs a parameter matrix, bmIs offset αexpIs a coefficient of relationship, and
Figure FDA0002415992320000049
weights corresponding to 51 tags;
Figure FDA00024159923200000410
a relational weight representing label selection by edge (i, j);
the working steps of the time chart convolution layer are as follows:
dynamic subgraph in time
Figure FDA00024159923200000411
The time chart convolution is adopted, and the formula is as follows:
Figure FDA00024159923200000412
Figure FDA00024159923200000413
wherein,
Figure FDA00024159923200000414
and
Figure FDA00024159923200000415
is a parameter matrix, dir (i, j) indicates the direction in which the edge (i, j) of the corresponding parameter matrix is selected,
Figure FDA00024159923200000416
is a region
Figure FDA00024159923200000417
The semantic coefficients of each of the neighborhoods,
Figure FDA00024159923200000418
representing the output, U, of the time-graph convolution layertemRepresenting a parameter matrix;
combining the outputs of the space-time convolution layer and the time-map convolution layer to obtain a result for the first space-time convolution layer
Figure FDA00024159923200000419
Figure FDA00024159923200000420
Performing multi-step coding by a space-time diagram encoder with T space-time convolution layers to obtain final relation sensitive region characteristics
Figure FDA00024159923200000421
Figure FDA00024159923200000422
And the relation sensitive area characteristic of the ith area of the tth frame in the video is shown.
5. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said implicit spatial subgraph is a spatial subgraphimpThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.
6. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said explicit spatial subgraph is a spatial subgraphexpThe construction method comprises the following steps:
extracting region triplets in each video frame
Figure FDA0002415992320000051
As
Figure FDA0002415992320000052
To
Figure FDA0002415992320000053
Wherein, wherein
Figure FDA0002415992320000054
And
Figure FDA0002415992320000055
is the ith area and the jth area of the tth frame in the video,
Figure FDA0002415992320000056
is that
Figure FDA0002415992320000057
And
Figure FDA0002415992320000058
the relationship predicate between the two, namely the label of the edge;
given the characteristics of the region i
Figure FDA0002415992320000059
Characteristics of region j
Figure FDA00024159923200000510
Joint feature of a joint region of two regions
Figure FDA00024159923200000511
The joint characteristics of the joint region are also obtained by fast-RCNN; inputting the three characteristics into a classifier trained on a visual genome data set in advance, and predicting to obtain a region
Figure FDA00024159923200000512
And
Figure FDA00024159923200000513
the relationship predicates between.
7. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein the temporal dynamics subgraph is a spatiotemporal graph inference networktemThe construction method comprises the following steps:
calculating the connection scores between each video frame and the regions in the M adjacent forward frames and M adjacent backward frames:
Figure FDA00024159923200000514
where cos (-) is the cosine similarity of the two features, IoU (-) is the intersection ratio of the two regions, ∈ is the balance scalar,
Figure FDA00024159923200000515
representing the connection fraction between the ith area of the t frame and the jth area of the k frame in the video;
for the
Figure FDA00024159923200000516
Selecting the region with the highest connection score from the k frame of the video
Figure FDA00024159923200000517
To construct an edge, each region obtaining a 2M +1 edge including a self-loop; edges of temporal dynamics graphtemNo mark, with three orientations: forward, backward and self-looping.
8. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the spatio-temporal locator of step S4 comprises a temporal locator and a spatial locator, specifically as follows:
4.1) set up the time locator, first aggregate the relationship sensitive region features to the frame level by the attention mechanism, for the query feature sqFrame level relationship sensitive feature m in videotExpressed as:
Figure FDA00024159923200000518
wherein m istRepresenting the t-th frame in videoA relationship-sensitive feature;
Figure FDA00024159923200000525
a row vector of the parameters is represented,
Figure FDA00024159923200000519
representing a parameter matrix, bfRepresents a bias;
gathering frame-level relation sensitive features in video
Figure FDA00024159923200000520
Set of visual features of video frame corresponding thereto
Figure FDA00024159923200000521
Concatenate and use another BiGRU to learn the final frame feature set
Figure FDA00024159923200000522
Next, at each frame t, a multiscale candidate clip is defined as
Figure FDA00024159923200000523
Wherein
Figure FDA00024159923200000524
Is the starting and ending boundary, w, of the kth clip of the t-th frame in the videokIs the width of the kth clip, P is the number of clips; then, all candidate clips are estimated through the sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Ct=σ(Wc[ht;sq]+bc)
t=Wo[ht;sq]+bo
wherein,
Figure FDA0002415992320000061
the confidence scores corresponding to the P candidate clips at frame t,
Figure FDA0002415992320000062
is the offset of P segments, WcAnd WoIs a parameter matrix, bcAnd boIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: clipping selected alignment loss and boundary adjusted regression loss; the alignment loss formula is as follows:
Figure FDA0002415992320000063
wherein,
Figure FDA0002415992320000064
representing the temporal intersection ratio of the kth candidate clip with the correct clip at frame t,
Figure FDA0002415992320000065
is represented by CtThe kth element of (1), i.e., the confidence score of the kth candidate clip at frame t;
adjustment has the highest
Figure FDA0002415992320000066
The boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)se) First, according to the real boundary
Figure FDA0002415992320000067
Calculating the offset of the clip
Figure FDA0002415992320000068
And
Figure FDA0002415992320000069
the regression loss formula is as follows:
Figure FDA00024159923200000610
wherein R represents a smoothing L1 function;
4.2) establishing a spatial locator for locating the target area in each frame, the relation sensitive area being characterized by
Figure FDA00024159923200000611
By characterizing the query sq and the final frame feature htPerforming fusion to estimate matching score of each region
Figure FDA00024159923200000612
The formula is as follows:
Figure FDA00024159923200000613
wherein,
Figure FDA00024159923200000614
is the matching score of the ith area of the t frame in the video, sigma (-) is sigmoid function, wcIs a parameter matrix, bcIs an offset;
the space loss formula is as follows:
Figure FDA00024159923200000615
wherein S istIs the set of frames in real time,
Figure FDA00024159923200000616
is a region
Figure FDA00024159923200000617
Cross-over ratio between the spatial and real areas.
9. The method for solving the task of polymorphic sentence video localization according to claim 8 using spatio-temporal graph inference network, wherein the multitask penalty function is as follows:
Figure FDA00024159923200000618
wherein λ is1、λ2And λ3Is a hyper-parameter that controls the balance between the three losses.
10. The method for solving the task of video localization in polymorphic sentences using spatio-temporal graph inference network as claimed in claim 1, wherein in said step S6, the formula for calculating the link score is as follows:
Figure FDA0002415992320000071
where s (-) represents the link score,
Figure FDA0002415992320000072
and
Figure FDA0002415992320000073
is a region
Figure FDA0002415992320000074
And
Figure FDA0002415992320000075
θ is a balance scalar, IoU (-) is an intersection-to-parallel ratio function;
the energy is calculated as follows:
Figure FDA0002415992320000076
wherein E (-) represents energy and (T)e,Ts) Is a time boundary, Y represents a pipeline, and directly uses Vitervi algorithm to obtain a region set which maximizes E (Y)
Figure FDA0002415992320000077
As the final pipe Y.
CN202010191264.5A 2020-03-18 2020-03-18 Multi-form sentence video positioning method based on space-time diagram inference network Active CN111414845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010191264.5A CN111414845B (en) 2020-03-18 2020-03-18 Multi-form sentence video positioning method based on space-time diagram inference network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010191264.5A CN111414845B (en) 2020-03-18 2020-03-18 Multi-form sentence video positioning method based on space-time diagram inference network

Publications (2)

Publication Number Publication Date
CN111414845A true CN111414845A (en) 2020-07-14
CN111414845B CN111414845B (en) 2023-06-16

Family

ID=71491198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010191264.5A Active CN111414845B (en) 2020-03-18 2020-03-18 Multi-form sentence video positioning method based on space-time diagram inference network

Country Status (1)

Country Link
CN (1) CN111414845B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597278A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Semantic information fusion method and device, electronic equipment and storage medium
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113688296A (en) * 2021-08-10 2021-11-23 哈尔滨理工大学 Method for solving video question-answering task based on multi-mode progressive attention model
WO2022088238A1 (en) * 2020-10-27 2022-05-05 浙江工商大学 Progressive positioning method for text-to-video clip positioning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324864A1 (en) * 2013-04-12 2014-10-30 Objectvideo, Inc. Graph matching by sub-graph grouping and indexing
US20190171954A1 (en) * 2016-05-13 2019-06-06 Numenta, Inc. Inferencing and learning based on sensorimotor input data
CN110377792A (en) * 2019-06-14 2019-10-25 浙江大学 A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324864A1 (en) * 2013-04-12 2014-10-30 Objectvideo, Inc. Graph matching by sub-graph grouping and indexing
US20190171954A1 (en) * 2016-05-13 2019-06-06 Numenta, Inc. Inferencing and learning based on sensorimotor input data
CN110377792A (en) * 2019-06-14 2019-10-25 浙江大学 A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
褚一平;叶修梓;张引;张三元;: "基于分层MRF模型的抗抖动视频分割算法" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022088238A1 (en) * 2020-10-27 2022-05-05 浙江工商大学 Progressive positioning method for text-to-video clip positioning
US11941872B2 (en) 2020-10-27 2024-03-26 Zhejiang Gongshang University Progressive localization method for text-to-video clip localization
CN112597278A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Semantic information fusion method and device, electronic equipment and storage medium
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113204675B (en) * 2021-07-07 2021-09-21 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113688296A (en) * 2021-08-10 2021-11-23 哈尔滨理工大学 Method for solving video question-answering task based on multi-mode progressive attention model

Also Published As

Publication number Publication date
CN111414845B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
Zhao et al. Jsnet: Joint instance and semantic segmentation of 3d point clouds
Yang et al. Pipeline magnetic flux leakage image detection algorithm based on multiscale SSD network
CN111414845A (en) Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network
CN111914778B (en) Video behavior positioning method based on weak supervision learning
CN109684912A (en) A kind of video presentation method and system based on information loss function
CN111666406B (en) Short text classification prediction method based on word and label combination of self-attention
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN110866542A (en) Depth representation learning method based on feature controllable fusion
CN115563327A (en) Zero sample cross-modal retrieval method based on Transformer network selective distillation
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN115391570A (en) Method and device for constructing emotion knowledge graph based on aspects
CN115687760A (en) User learning interest label prediction method based on graph neural network
CN115311465A (en) Image description method based on double attention models
CN114399661A (en) Instance awareness backbone network training method
Qi et al. Dgrnet: A dual-level graph relation network for video object detection
CN116958740A (en) Zero sample target detection method based on semantic perception and self-adaptive contrast learning
CN114120367B (en) Pedestrian re-recognition method and system based on circle loss measurement under meta-learning framework
CN116208399A (en) Network malicious behavior detection method and device based on metagraph
CN115661542A (en) Small sample target detection method based on feature relation migration
CN111858999B (en) Retrieval method and device based on segmentation difficult sample generation
Zheng Multiple-level alignment for cross-domain scene text detection
CN114004233A (en) Remote supervision named entity recognition method based on semi-training and sentence selection
Qu et al. Illation of video visual relation detection based on graph neural network
CN117237984B (en) MT leg identification method, system, medium and equipment based on label consistency
CN116150038B (en) Neuron sensitivity-based white-box test sample generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant