CN111414845A - Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network - Google Patents
Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network Download PDFInfo
- Publication number
- CN111414845A CN111414845A CN202010191264.5A CN202010191264A CN111414845A CN 111414845 A CN111414845 A CN 111414845A CN 202010191264 A CN202010191264 A CN 202010191264A CN 111414845 A CN111414845 A CN 111414845A
- Authority
- CN
- China
- Prior art keywords
- video
- frame
- region
- time
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000000007 visual effect Effects 0.000 claims abstract description 43
- 230000002123 temporal effect Effects 0.000 claims abstract description 23
- 238000010187 selection method Methods 0.000 claims abstract description 7
- 230000004807 localization Effects 0.000 claims description 33
- 239000013598 vector Substances 0.000 claims description 33
- 239000011159 matrix material Substances 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 12
- 238000010586 diagram Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000006386 neutralization reaction Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 9
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 34
- 238000002679 ablation Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- UTTZHZDGHMJDPM-NXCSSKFKSA-N 7-[2-[[(1r,2s)-1-hydroxy-1-phenylpropan-2-yl]amino]ethyl]-1,3-dimethylpurine-2,6-dione;hydrochloride Chemical compound Cl.C1([C@@H](O)[C@@H](NCCN2C=3C(=O)N(C)C(=O)N(C)C=3N=C2)C)=CC=CC=C1 UTTZHZDGHMJDPM-NXCSSKFKSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Operations Research (AREA)
- Multimedia (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for solving a polymorphic sentence video positioning task by a space-time graph reasoning network, belonging to the field of natural language visual positioning. The invention firstly analyzes the video into a space-time region graph, which not only has an implicit and explicit spatial sub-graph of each frame, but also has a cross-frame time dynamic sub-graph. Next, a text cue is added to the spatio-temporal region graph, and a multi-step cross-modal graph inference is established. The multi-step process may support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline. The invention does not need to prune the video when positioning the natural language, thus reducing the cost of video positioning; the method can effectively process the question sentences and the statement sentences, provides technical support for the combination research of higher-level natural language processing and computational vision (such as video question answering) and has wide application prospect.
Description
Technical Field
The invention relates to the field of natural language visual positioning, in particular to a method for solving a polymorphic sentence video positioning task by using a space-time graph reasoning network.
Background
Visual localization of natural language is a fundamental and crucial task in the field of visual understanding. The goal of this task is to locate objects described by a given natural language in visual content from a temporal, spatial perspective. In recent years, researchers have begun to focus on natural language (sentence) localization in video, including temporal localization and spatio-temporal localization. The time positioning can obtain the time slice of the object appearing in the video; the space-time positioning is to obtain the region where the object appears based on the time positioning, and the set of regions where such a series of objects are located is also called space-time tube (spatial-temporal tube) because of the temporal and spatial continuity.
At present, the method realized by people is less and has stronger limitation. Existing video localization methods often extract a set of spatio-temporal pipelines from the pruned video and then identify the target pipeline that matches the sentence. However, this framework may not be able to accomplish Spatio-Temporal Video localization (STVG) for polymorphic Sentences. On the one hand, the performance of the framework depends largely on the quality of the candidate pipeline, but it is difficult to generate a high-quality pipeline in advance without textual clues, because sentences may describe the short-term state of the object in a very small segment, but the existing pipeline pre-generation framework can only produce a complete object pipeline in the pruned video. On the other hand, these methods only consider single-pipe modeling, and ignore the relationships between objects, and thus cannot process question sentences with position objects, but can only process traditional statement sentences. However, object relationships are an important clue for STVG tasks, especially for interrogatories that may only provide interaction of unknown objects with other objects, because of lack of explicit characteristics of the objects, locating interrogatories can only depend on relationships (e.g., action relationships and spatial relationships) between unknown objects and other objects, and it is important to explain the construction of relationship models and cross-modal relationship reasoning. Therefore, the existing method cannot process STVG tasks.
In addition, the existing visual map modeling method often constructs a spatial map in an image, and cannot distinguish subtle differences of object actions, such as door opening and door closing, by utilizing time dynamic information in a video. Therefore, there is a need for a method that can solve the task of video localization of polymorphic sentences, locating the spatio-temporal pipeline of the queried object for a given un-cropped video and a statement or question regarding the descriptive object.
Disclosure of Invention
Aiming at the defect that the prior art can not solve the video positioning task of polymorphic sentences, the invention provides a method for solving the video positioning task of polymorphic sentences by using a space-time graph inference network. The spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the time dynamic subgraph can take the dynamic property of the object and the cross-frame transformation into consideration so as to further improve the understanding of the network to the relation between the objects. Next, text cues are added to the space-time region graph, a multi-step cross-modal graph inference is established, and a multi-step process can support multi-order relational modeling. Thereafter, the temporal boundaries of the pipeline are determined using a temporal locator, and then an object is located in each frame using a spatial locator with a dynamic selection method, resulting in a smooth pipeline.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network comprises the following steps:
s1: aiming at a section of video, extracting the visual characteristics of each frame in the video by using a Faster-RCNN network to form a visual characteristic set of video frames; extracting K regions from each video frame to obtain region characteristic vectors and region frame vectors to form a region set of a frame level in the video;
s2: aiming at a query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally further obtaining the query feature of the query statement by adopting an attention method;
s3: establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing a video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain a cross-modal regional feature; then, performing T-step convolution operation on the space-time region graph through T space-time convolution layers according to the cross-modal region characteristics to finally obtain relationship sensitive region characteristics;
s4: establishing a space-time locator comprising a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; integrating the query features of the query statement and the final frame features through a space locator to obtain a matching score of each region in each video frame;
s5: the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
s6: and (5) screening the frame t and the area i corresponding to the highest matching score obtained in the step (S5), calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of the pipeline according to the link score, and obtaining the space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
Further, the step S1 specifically includes:
for a piece of video, extracting K regions from each video frame by using pre-trained fast-RCNN to obtain a region set of a frame level in the videoEach regionThere are two attributes: one is a region feature vector Visual feature vector representing the ith region of the t-th frame in video, drRepresenting the dimensions of the region feature vectors; the other is a region bounding box vector wherein Andrespectively representing the abscissa and the ordinate of the central point of the bounding box of the ith area of the t-th frame in the video,andrespectively representing the width and the height of a bounding box of the ith area of the t frame in the video;
in addition, the visual features of each frame in the video are extracted by the fast-RCNN to form a visual feature set of the video frames wherein ftThe visual characteristics of the t-th frame in the video are shown, and N represents the frame number of the video.
Further, the step S2 specifically includes:
aiming at a query sentence, firstly, a GloVe network is adopted to obtain a word embedding vector of each word in the query sentence, then a BiGRU network is adopted to obtain word semantic features of the query sentence, and a word semantic feature set is formed wherein ,siIs the semantic feature of the ith word and is formed by connecting the forward hidden state and the backward hidden state of the ith node of the BiGRU network, L represents the number of words in the query sentence, dsA dimension representing a semantic feature vector of a word;
from a set of word semantic featuresIn selecting semantic features s of query objectseObtaining entity sensitivity s by attention methodaSaid entity sensitive characteristic saSemantic features s with query objectseComposing query features sq(ii) a The formula is as follows:
sq=[se;sa]
wherein ,andis a parameter matrix, saIs an entity sensitive feature, sqIs a query feature, gammaiRepresenting normalized weights.
Further, the step S3 specifically includes:
establishing a space-time graph encoder, which comprises a video analysis layer, a cross-mode fusion layer and T space-time convolution layers, wherein the working steps of the space-time graph encoder are as follows:
3.1) resolving the video into a space-time region graph through a video resolving layer, wherein the space-time region graph comprises three sub-graphs: implicit spatial subgraph in each frameExplicit spatial subgraph in each frameAnd cross-frame temporal dynamic subgraphWhereinIs the vertex of each sub-graph, and all three sub-graphs regard the region in each corresponding video frame as its vertex imp、exp、temRespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;
3.2) fusing the region feature vector obtained in the step S1 and the word semantic features obtained in the step S2 through a cross-modal fusion layer to obtain cross-modal region features, which are specifically as follows:
for ith area of t frame in videoComputing word semantic featuresThe attention weight of (a) is given,further obtaining the regional sensitive text characteristics, wherein the formula is as follows:
wherein,andis a parameter matrix, bmIs an offset, wTIs a row vector of the parameters which,to representAnd sjThe degree of similarity of (a) to (b),the weight of attention is represented as a weight of attention,is the regional sensitive text feature of the ith region of the t frame in the video;
establishing a text gate guided by language information and sensitive to regional text characteristicsThe text-independent region is attenuated, and the formula is as follows:
where, σ is the sigmoid function,indicating areaTextbook of drRepresenting the dimensions of the region feature vectors;
wherein, ⊙ is the multiplication element by element,the cross-modal region feature of the ith region of the tth frame in the video is represented;
3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer; the space map convolutional layer is used for obtaining the visual relationship between each frame region, and the specific details are as follows:
for cross-modal region features, first sub-map in implicit spaceThe implicit graph convolution is adopted, and the formula is as follows:
wherein,is thatNeutralization ofThe area of the connection is the area of the connection,is a weight parameter, WimpAnd UimpA matrix of the parameters is represented and,representing the output of the implicit spatial map convolutional layer;
then in the explicit space subgraphThe above adopts explicit graph convolution, and the formula is as follows:
αexp=Softmax(Wrsq+bm)
wherein,representing the output of the explicit spatial map convolutional layer, dir (i, j) is the direction of the edge (i, j),is an optional parameter matrix, lab (i, j) is the label of edge (i, j),is an optional offset that is set off by the user,is thatNeutralization ofRegion of connection, WrIs a parameter matrix, bmIs offset αexpIs a coefficient of relationship, andweights corresponding to 51 tags;a relational weight representing label selection by edge (i, j);
the time map convolutional layer is used for obtaining the dynamic property and the transformational property of the cross-frame object, and specifically comprises the following steps:
wherein,andis a parameter matrix, dir (i, j) indicates the direction in which the edge (i, j) of the corresponding parameter matrix is selected,is a regionThe semantic coefficients of each of the neighborhoods,representing the output, U, of the time-graph convolution layertemRepresenting a parameter matrix;
combining the outputs of the space-time convolution layer and the time-map convolution layer to obtain a result for the first space-time convolution layer
Multi-step encoding (will be) by a spatio-temporal image encoder with T spatio-temporal convolutional layersIs obtained as an input) Obtaining the final relation sensitive area characteristics And the relation sensitive area characteristic of the ith area of the tth frame in the video is shown.
In implicit spatial subgraphsimpThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.
Of explicit spatial subgraphsexpThe construction method comprises the following steps:
extracting region triplets in each video frameAsToWherein, whereinAndis the ith area and the jth area of the tth frame in the video,is thatAndthe relationship predicate between the two, namely the label of the edge;
given the characteristics of the region iCharacteristics of region jJoint feature of a joint region of two regionsThe joint characteristics of the joint region are also obtained by fast-RCNN; inputting the three characteristics into a classifier trained on a Visual Genome data set in advance, and predicting to obtain a regionAndthe relationship predicates between.
In a temporal dynamic subgraphtemThe construction method comprises the following steps:
calculating the connection scores between each video frame and the regions in the M adjacent forward frames and M adjacent backward frames:
where cos (-) is the cosine similarity of the two features, IoU (-) is the intersection ratio of the two regions, ∈ is the balance scalar,representing the connection fraction between the ith area of the t frame and the jth area of the k frame in the video;
for theSelecting the region with the highest connection score from the k frame of the videoTo construct an edge, each region obtaining a 2M +1 edge including a self-loop; edges of temporal dynamics graphtemNo mark, with three orientations: forward, backward and self-looping.
Further, the step S4 specifically includes:
4.1) set up the time locator, first aggregate the relationship sensitive region features to the frame level by the attention mechanism, for the query feature sqFrame level relationship sensitive feature m in videotExpressed as:
wherein m istRepresenting the relation sensitive characteristic of the t frame in the video; w is aTA row vector of the parameters is represented,representing a parameter matrix, bfRepresents a bias;
will be in the videoFrame-level set of relationship-sensitive featuresSet of visual features of video frame corresponding theretoConcatenate and use another BiGRU to learn the final frame feature set
Next, at each frame t, a multiscale candidate clip is defined asWhereinIs the starting and ending boundary, w, of the kth clip of the t-th frame in the videokIs the width of the kth clip, P is the number of clips; then, all candidate clips are estimated through the sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Ct=σ(Wc[ht;sq]+bc)
t=Wo[ht;sq]+bo
wherein,the confidence scores corresponding to the P candidate clips at frame t,is the offset of P segments, WcAnd WoIs a parameter matrix, bcAnd boIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: clipping selected alignment loss and boundary adjusted regression loss; the alignment loss formula is as follows:
wherein,representing the temporal intersection ratio of the kth candidate clip with the correct clip at frame t,is represented by CtThe kth element of (1), i.e., the confidence score of the kth candidate clip at frame t;
next, the fine tuning has the highestThe boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)s,e) First, according to the real boundaryCalculating the offset of the clipAndthe regression loss formula is as follows:
wherein R represents a smoothing L1 function;
4.2) establishing a spatial locator for locating the target area in each frame, the relation sensitive area being characterized byBy characterizing the query as sqAnd final frame feature htPerforming fusion to estimate matching score of each regionThe formula is as follows:
wherein,is the matching score of the ith area of the t frame in the video, sigma (-) is sigmoid function, WcIs a parameter matrix, bcIs an offset;
the space loss formula is as follows:
wherein S istIs the set of frames in real time,is a regionCross-over ratio between spatial and real area;
the multitask penalty function is as follows:
wherein λ is1、λ2And λ3Is a hyper-parameter that controls the balance between the three losses.
Further, the step S5 specifically includes:
the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
further, the step S6 specifically includes:
the calculation formula of the link score is as follows:
where s (-) represents the link score,andis a regionAndθ is a balance scalar set to 0.2, IoU (-) is a cross-over function;
the energy is calculated as follows:
wherein E (-) represents energy and (T)e,Ts) Is a time boundary, Y represents a pipeline, and directly uses Vitervi algorithm to obtain a region set which maximizes E (Y)As the final pipe Y.
The invention has the following beneficial effects:
(1) the invention does not need to prune the video when positioning the natural language, can directly process the long video, and reduces the cost of video positioning;
(2) the space-time region graph obtained in the visual graph modeling process not only has implicit and explicit spatial subgraphs of each frame, but also has cross-frame time dynamic subgraphs; the spatial subgraph can obtain the relation of the region level through an implicit or explicit attention mechanism, and the temporal dynamic subgraph can take the dynamic property and the cross-frame transformation of the object into consideration, effectively utilizes the temporal dynamic information in the video to distinguish the subtle difference of the action of the object, so as to further improve the understanding of the network to the relation between the objects;
(3) the present invention introduces a spatio-temporal locator to retrieve the spatio-temporal pipeline of objects directly from the region. Specifically, a time locator is used for determining the time boundary of the pipeline, then a space locator with a dynamic selection method is used for locating an object in each frame and generating a smooth pipeline, question sentences and statement sentences can be effectively processed, video location of polymorphic sentences is realized, a large number of experiments prove the effectiveness of the method, and technical support is provided for higher-level natural language processing and computational vision combined research (such as video question answering);
(4) the invention has wide application prospect, such as directly searching video content and classifying video by using characters.
Drawings
FIG. 1 is a schematic diagram of the STGRN structure of the present invention;
fig. 2 is an experimental result of the criteria m _ tilou and m _ vliou for statement sentences and question sentences.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the present invention aims at a section of video and input sentences, and locates the objects in the sentences in each frame by using a method of solving the polymorphic sentence video location task by using a space-time graph inference network, and generates a smooth pipeline, and the specific steps are as follows:
the method comprises the steps that firstly, visual features of each frame in a video are extracted by utilizing a Faster-RCNN network aiming at a section of video, and a visual feature set of video frames is formed; and extracting K regions from each video frame to obtain region feature vectors and region frame vectors, and forming a region set of a frame level in the video.
And secondly, aiming at the query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then, obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally, further obtaining the query feature of the query statement by adopting an attention method.
Step three, establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing the video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain cross-modal regional features; and performing T-step convolution operation on the space-time region diagram through T space-time convolution layers according to the cross-modal region characteristics to finally obtain the relation sensitive region characteristics.
Step four, establishing a space-time locator which comprises a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; and then integrating the query features of the query statement and the final frame features through a spatial locator to obtain the matching score of each region in each video frame.
Step five, the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; and for the section of video processed in the step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN.
And step six, screening the frame t and the area i corresponding to the highest matching score obtained in the step five, calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of a pipeline according to the link score, and obtaining a space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
Examples
The invention establishes a large-scale space-time video positioning data set VidSTG by adding sentence annotation on VidOR (Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xin Yang, and Tat-Seng Chua. inhibiting objects and relationships in used generated video ICMR, pages 279-287. ACM,2019.), and carries out verification on the VidSTG data set. VidOR is the largest existing video data set containing object relationships, containing 10,000 videos and objects and fine-grained annotation of the relationships therein. VidOR annotates 80 object classes with dense bounding boxes and 50 relationship predicate classes (8 spatial relationships and 42 action relationships) between objects, representing the relationships as triples < subject, predicate, object >, each associated with a time boundary and a space-time pipe (to which the subject and object belong). Appropriate triples are selected based on VidOR and the subject or object is described in various forms of sentences. There are many advantages to using VidOR as the underlying data set. On the one hand, laborious annotation of the bounding box can be avoided. On the other hand, the relationships in the triples may simply be incorporated into the annotated sentences. For each video triple, a subject or object is selected as the object of the query, and its appearance, relationship to other objects, and visual environment are then described. For query annotation, the appearance of the queried object will be ignored. A video triplet set may correspond to multiple sentences.
After annotation, 4,808 video triples were obtained, corresponding to 80 query objects and 99,943 sentence descriptions. The average duration of the video is 28.01 seconds, and the average length of the object pipe is 9.68 seconds. The average number of words contained in the statement sentence and the question sentence was 11.12 and 8.98, respectively. Table 1 gives the statistics of these sentences.
TABLE 1 data set statistics on the number of statement sentences and question sentences
In bookIn a specific implementation of the invention, for video, 5 frames per second are first sampled, down-sampling the frame number of the super-long video to 200 frames, then pre-training the Faster R-CNN on MSCOCO (Tsung-Yi L in, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll' ar, and C L awrence Zitnick. Microsoft coco: Common objects in content. in ECCV, pages 740 42 755. Springer,2014.) extracting 20 region proposals (i.e., K20) for each frame, region feature size drTo 1024, it is mapped to 256 before graph modeling for the query statement, a 300-dimensional word-embedding vector is extracted using pre-trained Glove word2vec (Jeffrey Pennington, Richard Socher, and Christopher manning. Glove: Globalctors for word representation. in EMN L P, pages 1532-1543, 2014.) as for the hyper-parameters, M is set to 5, ∈ is set to 0.8, θ is set to 0.2, λ is set to 0.2, and λ is set to1、λ2、λ3Set to 1.0, 0.001 and 1.0, respectively.
The number of layers T of the space-time image encoder is set to be 2; for the time locator, P is set to 8 and 8 window widths [8,16,32,64,96,128,164,196 ] are defined](ii) a Setting dimensions of parameter matrix and bias to 256, including in explicit graph convolution layersAndw in time locatorfAnd bfEtc.; the BiGRU network has 128-dimensional hidden states in each direction. In the training process, Adam optimizer is applied to minimize multitask lossThe initial learning rate of the model was set to 0.001 and the batch size was set to 16.
The verification results are evaluated, using m _ IoU, m _ vIoU and vIoU @ R as evaluation criteria. m _ IoU is the time-averaged cross-over ratio (IoU) between the selected segment and the real segment, SUDefined as a collection of frames contained in a selected or real segmentWill SIDefined as the set of frames contained in the selected segment and the real segment. The invention is provided withTo calculate the vIoU, where reAndrespectively a selected region and a real region in the tth frame of the video. m _ vIoU is the average of the samples vIoU, vIoU @ R is vIoU>Sample ratio of R.
To verify the effectiveness of the present invention, the original visual localization method group R (any Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schie. group great temporal graphics in images by ECCV, pages 817 834. Springer,2016.), the video localization method STPR (Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Uhiku, and Tatsuya Harada. spatial-temporal pervison visual sources. in ICCV, pages 1453 1462,2017 cut), and STG (Zhenfafen Cheng, Wei Shen, Wei 539 2, Wenqueu, Wei Shendao, Woor, WSpr, etc.) were initially selected and compared to the original visual localization method group of the original visual localization method, and the original visual localization method WSPR 677, and the original visual localization method of the STpr was initially applied to generate a pipeline search query, and the original visual localization method of the original visual localization by using a time-supervised pipeline 677, and the STventer model, and the STcited images of the STcontained in the STarrive, the original visual localization method, the original visual localization frame, and the original visual localization method of the STpr, the original visual localization method was initially provided by the invention, the original visual localization method of STcontained by the STpr, the original pipeline.
Table 2 is a test result for a statement sentence and table 3 is a test result for an question sentence, where stgrn (greedy) generates a pipe using greedy region selection instead of a dynamic method, Random randomly selects a time segment and a spatial region, and tem.
TABLE 2
TABLE 3
As can be seen from the test results in tables 2-3:
(1) the Grounder + {. The method locates sentences independently in each frame, and the performance is worse than STPR + {. The method and WSSTG + {. The method verifies that the time dynamics of the cross frame is crucial to the space-time location; and the STGRN adopting the dynamic selection method is superior to the STGRN (greedy) adopting the greedy method, which shows that the dynamic smoothing is favorable for generating high-quality pipelines.
(2) The STGRN of the present invention has better performance than the frame-level localization methods TA LL and L-Net in terms of temporal localization, proving that space-time region modeling is effective for determining temporal boundaries of object conduits.
(3) For space-time positioning, the STGRN of the present invention has better performance on both statement sentences and question sentences than all control groups with or without real time segments, which shows that the cross-modal space-time graph inference of the present invention can effectively obtain object relationships with space-time dynamics, and the space-time positioner can accurately retrieve objects.
Next, ablation experiments were continued on a spatio-temporal region map, which is a key component of STGRN. In particular, the spatio-temporal graph includes an implicit spatial subgraphExplicit spatial subgraphAnd temporal dynamics subgraphThey are selectively discarded in this implementation to generate an ablation model and the ablation results are given in table 4, and no statement sentence and question sentence are distinguished in this implementation. From the results in table 4, the complete model of the present invention outperforms all ablation models, verifying that each sub-graph is very helpful for spatio-temporal video localization. If only subgraphs are applied, then the subgraphs are usedThe model of (2) will achieve the best performance, which indicates that explicit modeling is most important for capturing object relationships. Also, if two subgraphs are used, then haveAndthe model of (a) is superior to other models, which suggests that spatio-temporal modeling plays a crucial role in relation understanding and high quality video localization.
TABLE 4 ablation experimental results on VidSTG dataset
Furthermore, the number of layers T is an important hyper-parameter of the space-time diagram. The present example investigated the effect of T by changing the value of T from 1 to 5. Fig. 2 shows experimental results of m _ tlou and m _ vliou for statement sentences and question sentences. From the results, it can be seen that the performance of STGRN is best when T is set to 2. Single-layer graphs do not adequately capture object relationships and temporal dynamics. Too many layers may result in regions that are overly smooth, i.e., each region tends to be characterized the same. The performance variation across different standards and sentence types is essentially consistent, which illustrates that the effect of T is stable.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.
Claims (10)
1. The method for solving the polymorphic sentence video positioning task by using the space-time graph reasoning network is characterized by comprising the following steps of:
s1: aiming at a section of video, extracting the visual characteristics of each frame in the video by using a Faster-RCNN network to form a visual characteristic set of video frames; extracting K regions from each video frame to obtain region characteristic vectors and region frame vectors to form a region set of a frame level in the video;
s2: aiming at a query statement, firstly, obtaining a word embedding vector of each word in the query statement by adopting a GloVe network, then obtaining a word semantic feature set of the query statement by adopting a BiGRU network, and finally further obtaining the query feature of the query statement by adopting an attention method;
s3: establishing a space-time graph encoder which comprises a video analysis layer, a cross-modal fusion layer and T space-time convolution layers, firstly analyzing a video into a space-time regional graph through the video analysis layer, and then fusing the regional feature vector obtained in the step S1 and the word semantic feature obtained in the step S2 through the cross-modal fusion layer to obtain a cross-modal regional feature; then, performing T-step convolution operation on the space-time region graph through T space-time convolution layers according to the cross-modal region characteristics to finally obtain relationship sensitive region characteristics;
s4: establishing a space-time locator comprising a time locator and a space locator; for the relation sensitive region characteristics in the video, firstly aggregating the relation sensitive region characteristics to a frame level through a time locator to obtain the relation sensitive characteristics of the frame level in the video, and connecting the relation sensitive characteristics with the visual characteristic set of the video frame to obtain a final frame characteristic set; defining a multi-scale candidate clip set at each frame, and learning to obtain an optimal clip boundary; integrating the query features of the query statement and the final frame features through a space locator to obtain a matching score of each region in each video frame;
s5: the GloVe network, the BiGRU network, the space-time diagram encoder and the space-time locator form an STGRN, a multi-task loss is designed, and the STGRN is trained in an end-to-end mode; for the section of video processed in step S1 and the query sentence to be processed, obtaining a matching score of each region in each video frame through the trained STGRN;
s6: and (5) screening the frame t and the area i corresponding to the highest matching score obtained in the step (S5), calculating the link score between the areas of the frame t and the frame t +1 by adopting a dynamic selection method, calculating the energy of the pipeline according to the link score, and obtaining the space-time pipeline with the maximum energy by utilizing a Vitervi algorithm to complete video positioning.
2. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S1 specifically comprises:
for a piece of video, extracting K regions from each video frame by using pre-trained fast-RCNN to obtain a region set of a frame level in the videoEach regionThere are two attributes: one is a region feature vector Visual feature vector representing ith area of t frame in video,drRepresenting the dimensions of the region feature vectors; the other is a region bounding box vectorWhereinAndrespectively representing the abscissa and the ordinate of the central point of the bounding box of the ith area of the t-th frame in the video,andrespectively representing the width and the height of a bounding box of the ith area of the t frame in the video;
3. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S2 specifically comprises:
aiming at a query sentence, firstly, a GloVe network is adopted to obtain a word embedding vector of each word in the query sentence, then a BiGRU network is adopted to obtain word semantic features of the query sentence, and a word semantic feature set is formedWherein s isiIs a semantic feature of the ith word, L represents a word in a query statementNumber, dsA dimension representing a semantic feature vector of a word;
from a set of word semantic featuresSelecting semantic feature se of query object, and obtaining entity sensitive feature s by attention methodaForming a query feature sqThe formula is as follows:
sq=[se;sa]
4. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the step S3 specifically comprises:
establishing a space-time graph encoder, which comprises a video analysis layer, a cross-mode fusion layer and T space-time convolution layers, wherein the working steps of the space-time graph encoder are as follows:
3.1) resolving the video into a space-time region graph through a video resolving layer, wherein the space-time region graph comprises three sub-graphs: implicit spatial subgraph in each frameExplicit spatial subgraph in each frameAnd cross-frame temporal dynamic subgraphWhereinIs the vertex of each sub-graph, and the three sub-graphs all regard the region in each corresponding video frame as the vertex v;imp、exp、temrespectively representing the edges of an implicit spatial subgraph, an explicit spatial subgraph and a time dynamic subgraph;
3.2) fusing the region feature vector obtained in the step S1 and the word semantic features obtained in the step S2 through a cross-modal fusion layer to obtain cross-modal region features, which are specifically as follows:
for theCalculating the characteristics of the region sensitive text, wherein the formula is as follows:
wherein,andis a parameter matrix, bmIs a bias that is a function of the bias,is a row vector of the parameters which,to representAnd sjThe degree of similarity of (a) to (b),the weight of attention is represented as a weight of attention,is the regional sensitive text feature of the ith region of the t frame in the video;
establishing a textbook guided by language information, wherein the formula is as follows:
where, σ is the sigmoid function,indicating areaTextbook of drRepresenting the dimensions of the region feature vectors;
wherein, ⊙ is the multiplication element by element,the cross-modal region feature of the ith region of the tth frame in the video is represented;
3.3) each space-time convolution layer comprises a space map convolution layer and a time map convolution layer;
the working steps of the space map convolutional layer are as follows:
for cross-modal region features, first sub-map in implicit spaceThe implicit graph convolution is adopted, and the formula is as follows:
wherein,is thatNeutralization ofThe area of the connection is the area of the connection,is a weight parameter, wimpAnd uimpA matrix of the parameters is represented and,representing the output of the implicit spatial map convolutional layer;
then in the explicit space subgraphThe above adopts explicit graph convolution, and the formula is as follows:
αexp=Softmax(wrsq+bmm)
wherein,representing the output of the explicit spatial map convolutional layer, dir (i, j) is the direction of the edge (i, j),is an optional parameter matrix, lab (i, j) is the label of edge (i, j),is an optional offset that is set off by the user,is thatNeutralization ofRegion of connection, wrIs a parameter matrix, bmIs offset αexpIs a coefficient of relationship, andweights corresponding to 51 tags;a relational weight representing label selection by edge (i, j);
the working steps of the time chart convolution layer are as follows:
wherein,andis a parameter matrix, dir (i, j) indicates the direction in which the edge (i, j) of the corresponding parameter matrix is selected,is a regionThe semantic coefficients of each of the neighborhoods,representing the output, U, of the time-graph convolution layertemRepresenting a parameter matrix;
combining the outputs of the space-time convolution layer and the time-map convolution layer to obtain a result for the first space-time convolution layer
5. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said implicit spatial subgraph is a spatial subgraphimpThe construction method of (1) is that K regions in each video frame are fully connected, and K × K undirected unmarked edges are contained.
6. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein said explicit spatial subgraph is a spatial subgraphexpThe construction method comprises the following steps:
extracting region triplets in each video frameAsToWherein, whereinAndis the ith area and the jth area of the tth frame in the video,is thatAndthe relationship predicate between the two, namely the label of the edge;
given the characteristics of the region iCharacteristics of region jJoint feature of a joint region of two regionsThe joint characteristics of the joint region are also obtained by fast-RCNN; inputting the three characteristics into a classifier trained on a visual genome data set in advance, and predicting to obtain a regionAndthe relationship predicates between.
7. The method for solving the task of polymorphic sentence video localization as recited in claim 4, wherein the temporal dynamics subgraph is a spatiotemporal graph inference networktemThe construction method comprises the following steps:
calculating the connection scores between each video frame and the regions in the M adjacent forward frames and M adjacent backward frames:
where cos (-) is the cosine similarity of the two features, IoU (-) is the intersection ratio of the two regions, ∈ is the balance scalar,representing the connection fraction between the ith area of the t frame and the jth area of the k frame in the video;
8. The method for solving the task of polymorphic sentence video localization according to claim 1, wherein the spatio-temporal locator of step S4 comprises a temporal locator and a spatial locator, specifically as follows:
4.1) set up the time locator, first aggregate the relationship sensitive region features to the frame level by the attention mechanism, for the query feature sqFrame level relationship sensitive feature m in videotExpressed as:
wherein m istRepresenting the t-th frame in videoA relationship-sensitive feature;a row vector of the parameters is represented,representing a parameter matrix, bfRepresents a bias;
gathering frame-level relation sensitive features in videoSet of visual features of video frame corresponding theretoConcatenate and use another BiGRU to learn the final frame feature set
Next, at each frame t, a multiscale candidate clip is defined asWhereinIs the starting and ending boundary, w, of the kth clip of the t-th frame in the videokIs the width of the kth clip, P is the number of clips; then, all candidate clips are estimated through the sigmoid function linear layer, and the offset of the boundary is generated at the same time, and the calculation formula is as follows:
Ct=σ(Wc[ht;sq]+bc)
t=Wo[ht;sq]+bo
wherein,the confidence scores corresponding to the P candidate clips at frame t,is the offset of P segments, WcAnd WoIs a parameter matrix, bcAnd boIs the offset, σ (·) is the sigmoid function;
the time locator described has two losses: clipping selected alignment loss and boundary adjusted regression loss; the alignment loss formula is as follows:
wherein,representing the temporal intersection ratio of the kth candidate clip with the correct clip at frame t,is represented by CtThe kth element of (1), i.e., the confidence score of the kth candidate clip at frame t;
adjustment has the highestThe boundary of the best clip of (a), the boundary of (s, e) and the offset of (a)s,e) First, according to the real boundaryCalculating the offset of the clipAndthe regression loss formula is as follows:
wherein R represents a smoothing L1 function;
4.2) establishing a spatial locator for locating the target area in each frame, the relation sensitive area being characterized byBy characterizing the query sq and the final frame feature htPerforming fusion to estimate matching score of each regionThe formula is as follows:
wherein,is the matching score of the ith area of the t frame in the video, sigma (-) is sigmoid function, wcIs a parameter matrix, bcIs an offset;
the space loss formula is as follows:
10. The method for solving the task of video localization in polymorphic sentences using spatio-temporal graph inference network as claimed in claim 1, wherein in said step S6, the formula for calculating the link score is as follows:
where s (-) represents the link score,andis a regionAndθ is a balance scalar, IoU (-) is an intersection-to-parallel ratio function;
the energy is calculated as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010191264.5A CN111414845B (en) | 2020-03-18 | 2020-03-18 | Multi-form sentence video positioning method based on space-time diagram inference network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010191264.5A CN111414845B (en) | 2020-03-18 | 2020-03-18 | Multi-form sentence video positioning method based on space-time diagram inference network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111414845A true CN111414845A (en) | 2020-07-14 |
CN111414845B CN111414845B (en) | 2023-06-16 |
Family
ID=71491198
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010191264.5A Active CN111414845B (en) | 2020-03-18 | 2020-03-18 | Multi-form sentence video positioning method based on space-time diagram inference network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111414845B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597278A (en) * | 2020-12-25 | 2021-04-02 | 北京知因智慧科技有限公司 | Semantic information fusion method and device, electronic equipment and storage medium |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
CN113688296A (en) * | 2021-08-10 | 2021-11-23 | 哈尔滨理工大学 | Method for solving video question-answering task based on multi-mode progressive attention model |
WO2022088238A1 (en) * | 2020-10-27 | 2022-05-05 | 浙江工商大学 | Progressive positioning method for text-to-video clip positioning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140324864A1 (en) * | 2013-04-12 | 2014-10-30 | Objectvideo, Inc. | Graph matching by sub-graph grouping and indexing |
US20190171954A1 (en) * | 2016-05-13 | 2019-06-06 | Numenta, Inc. | Inferencing and learning based on sensorimotor input data |
CN110377792A (en) * | 2019-06-14 | 2019-10-25 | 浙江大学 | A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
-
2020
- 2020-03-18 CN CN202010191264.5A patent/CN111414845B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140324864A1 (en) * | 2013-04-12 | 2014-10-30 | Objectvideo, Inc. | Graph matching by sub-graph grouping and indexing |
US20190171954A1 (en) * | 2016-05-13 | 2019-06-06 | Numenta, Inc. | Inferencing and learning based on sensorimotor input data |
CN110377792A (en) * | 2019-06-14 | 2019-10-25 | 浙江大学 | A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
Non-Patent Citations (1)
Title |
---|
褚一平;叶修梓;张引;张三元;: "基于分层MRF模型的抗抖动视频分割算法" * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022088238A1 (en) * | 2020-10-27 | 2022-05-05 | 浙江工商大学 | Progressive positioning method for text-to-video clip positioning |
US11941872B2 (en) | 2020-10-27 | 2024-03-26 | Zhejiang Gongshang University | Progressive localization method for text-to-video clip localization |
CN112597278A (en) * | 2020-12-25 | 2021-04-02 | 北京知因智慧科技有限公司 | Semantic information fusion method and device, electronic equipment and storage medium |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
CN113204675B (en) * | 2021-07-07 | 2021-09-21 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
CN113688296A (en) * | 2021-08-10 | 2021-11-23 | 哈尔滨理工大学 | Method for solving video question-answering task based on multi-mode progressive attention model |
Also Published As
Publication number | Publication date |
---|---|
CN111414845B (en) | 2023-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Jsnet: Joint instance and semantic segmentation of 3d point clouds | |
Yang et al. | Pipeline magnetic flux leakage image detection algorithm based on multiscale SSD network | |
CN111414845A (en) | Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network | |
CN111914778B (en) | Video behavior positioning method based on weak supervision learning | |
CN109684912A (en) | A kind of video presentation method and system based on information loss function | |
CN111666406B (en) | Short text classification prediction method based on word and label combination of self-attention | |
CN113204675B (en) | Cross-modal video time retrieval method based on cross-modal object inference network | |
CN110866542A (en) | Depth representation learning method based on feature controllable fusion | |
CN115563327A (en) | Zero sample cross-modal retrieval method based on Transformer network selective distillation | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN115391570A (en) | Method and device for constructing emotion knowledge graph based on aspects | |
CN115687760A (en) | User learning interest label prediction method based on graph neural network | |
CN115311465A (en) | Image description method based on double attention models | |
CN114399661A (en) | Instance awareness backbone network training method | |
Qi et al. | Dgrnet: A dual-level graph relation network for video object detection | |
CN116958740A (en) | Zero sample target detection method based on semantic perception and self-adaptive contrast learning | |
CN114120367B (en) | Pedestrian re-recognition method and system based on circle loss measurement under meta-learning framework | |
CN116208399A (en) | Network malicious behavior detection method and device based on metagraph | |
CN115661542A (en) | Small sample target detection method based on feature relation migration | |
CN111858999B (en) | Retrieval method and device based on segmentation difficult sample generation | |
Zheng | Multiple-level alignment for cross-domain scene text detection | |
CN114004233A (en) | Remote supervision named entity recognition method based on semi-training and sentence selection | |
Qu et al. | Illation of video visual relation detection based on graph neural network | |
CN117237984B (en) | MT leg identification method, system, medium and equipment based on label consistency | |
CN116150038B (en) | Neuron sensitivity-based white-box test sample generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |