CN112347965A - Video relation detection method and system based on space-time diagram - Google Patents
Video relation detection method and system based on space-time diagram Download PDFInfo
- Publication number
- CN112347965A CN112347965A CN202011280036.1A CN202011280036A CN112347965A CN 112347965 A CN112347965 A CN 112347965A CN 202011280036 A CN202011280036 A CN 202011280036A CN 112347965 A CN112347965 A CN 112347965A
- Authority
- CN
- China
- Prior art keywords
- entity
- video
- convolution network
- segment
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 50
- 238000010586 diagram Methods 0.000 title claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 27
- 239000013598 vector Substances 0.000 claims description 51
- 230000007774 longterm Effects 0.000 claims description 28
- 238000009826 distribution Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000009466 transformation Effects 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 11
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 7
- 125000004122 cyclic group Chemical group 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 4
- 238000000844 transformation Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 230000004931 aggregating effect Effects 0.000 abstract 1
- 230000002123 temporal effect Effects 0.000 abstract 1
- 230000000007 visual effect Effects 0.000 description 16
- 230000006870 function Effects 0.000 description 11
- 238000012549 training Methods 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008846 dynamic interplay Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a video relation detection method and system based on a space-time diagram. First, the set of entities in the video and their relationships are modeled as a fully connected spatio-temporal graph that includes the entity nodes in the neighborhood of the temporal and spatial dimensions. For relationship detection, the invention provides a video relationship detection graph convolution network model (VRD-GCN) used for aggregating information from context and reasoning in the space-time graph. In one aspect, the VRD-GCN detects dynamic relationships between entities by capturing the relative changes in geometry and appearance of the entities in the spatiotemporal dimension. On the other hand, by passing messages of nodes and contexts in the neighborhood of the space-time diagram to the target entity, the VRD-GCN can generate more accurate and complete detection results. Upon detecting relationship instances in a video segment, short-term relationship instances throughout the video are merged using an online association method using a twin network. The method has high accuracy in relation detection in the video.
Description
Technical Field
The invention relates to video relation detection, a space-time graph convolutional neural network and a twin association network in machine learning and computer vision research, in particular to a video relation (visual relation) detection method and system based on a space-time graph.
Background
Understanding visual information is a primary goal of computer vision. Relationship detection in visual content requires the capture of fine-grained visual cues, including locating the locations of entities and the way they interact, which is a challenging but meaningful task. Although the relationships between objects in video are an important component of the deep understanding of dynamic visual content, relationship detection and reasoning in video is rarely studied. Successful attempts to detect video relationships will not only help us to build more efficient models for certain advanced visual understanding tasks (e.g., visual problem solving and visual captioning), but will also promote the development of other areas of computer vision, such as: video retrieval, video motion detection and video activity recognition.
A number of recent studies have achieved exciting and important results in static image relationship detection. A natural solution for relationship detection in video is to extend these methods directly to video. However, satisfactory results cannot be obtained due to the inherent differences between images and video. Methods designed for static image relationship detection and reasoning tend to ignore dynamic interactions between entities, which always occur in video. Given the nature of video, a relationship detection and inference solution in video should be able to capture dynamic and time-varying relationships between entities. The paper "Video Visual relationship Detection" by Xindi Shang et al is the only attempt to focus on detecting relationships in Video so far, however, this approach has limited performance due in part to its lack of ability to gather cues from the surrounding environment.
Aiming at the problems, the invention provides a video relation detection method based on a space-time diagram. Unlike the aforementioned methods, the present method utilizes message communication between entities to perform video relationship prediction. In addition, in order to solve the problem that whether two tracks in the continuous section belong to the same entity cannot be determined only by means of geometric overlapping due to the generation of a scene change or track drift problem, the invention provides a novel online association method using a twin network, the method considers the geometric overlapping of appearance similarity and a relationship example, and the accuracy is greatly improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a video relation detection method and system based on a space-time diagram.
The invention firstly discloses a video relation detection method based on a space-time diagram, which comprises the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
2) respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
Preferably, the processing procedure of the associated module with the twin network is as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
where emb () represents the embedded network,a function representing the degree of similarity of the cosine,andis any two tracks in two adjacent segments,andare respectively a trackAndthe features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient scoreThe formula is as follows:
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T isWherein,is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated asWhere c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collectionAnd collectionsPush buttonAnd c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the setAnd inner loop traversal setFor short-term relationship examplesBelong to a setAnd long term relationship examplesBelong to a set And andcalculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segmentConfidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n])。
the invention also discloses a video relation detection system based on the space-time diagram, which comprises the following steps:
the characteristic extraction module is used for acquiring the frame-level entity characteristics of the video clips and connecting the frame-level entity frames in each clip to generate entity track characteristics;
the characteristic splicing module is used for respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and the spliced entity characteristics and the entity track characteristics are used as two branch inputs of the fully-connected space-time graph convolution network module;
the fully-connected space-time graph convolution network module is provided with two branches and comprises a plurality of space-time graph convolution network modules; each space-time graph convolution network module consists of a geometric graph convolution network and an appearance graph convolution network; adding the entity characteristics input into the space-time convolution network module with the output of the geometric graph convolution network in the space-time convolution network module and the output of the appearance convolution network to obtain an output result of the space-time convolution network module, and taking the output result after ReLU activation and normalization as the input of the next space-time graph convolution network module;
the characteristic graph extraction module extracts the characteristic graph of the entity in the current segment from the output of the two branches of the fully-connected space-time graph convolution network module in an element addition mode;
a first feature vector generation unit for obtaining a vector of the predicted entity classification;
a second feature vector generation unit configured to obtain a vector of the predicted predicate distribution;
the relation instance module multiplies each vector for predicting entity classification by the vector of prediction predicate distribution, and for each video segment, the L relation instances with the highest scores in the multiplication results are taken as the input of the association module with the twin network;
the association module with the twin network combines the short-term relationship examples in the whole video by using an online association method of the twin network; acquiring an association confidence score;
and the detection result output module is used for arranging the detection results of the association module with the twin network in a descending order according to the confidence score and outputting the video relation detection result.
Because the association method with the twin network is adopted, the problem that the greedy association algorithm adopted in the prior art only utilizes geometric information and the algorithm result is inaccurate when the track generation is inaccurate or the track drift problem occurs is solved, so that the accuracy of the track association result is effectively improved, and the performance of the association algorithm is improved. In addition, the video relation detection model VRD-GCN based on the space-time diagram abstracts videos into the fully-connected space-time diagram, transmits messages in the space-time diagram and carries out reasoning, and the method is novel and has excellent video relation detection results.
Drawings
FIG. 1 sample VidVRD video visual relationship data;
FIG. 2 is a graph of accuracy over training epoch using VRD-GCN on VidVRD datasets;
FIG. 3 is an algorithm iteration convergence curve;
FIG. 4 is a comparison of VRD-GCN video relationship detection results with reference results.
FIG. 5 is a flow chart of a method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 5, the video relationship detection method based on the space-time diagram of the present invention includes the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
dividing the video into a plurality of segments, each segment comprising a plurality of frames; for each fragment, generating an entity detection frame on each frame, extracting entity characteristics, and connecting the frame-level entity frames in each fragment to generate entity track characteristics; and sorting the generated entity tracks in a descending order according to the vIoU value, and taking the first N tracks as the entity track characteristics of the segment.
2) Respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
specifically, the step 4) is as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
where emb () represents the embedded network,a function representing the degree of similarity of the cosine,andis any two tracks in two adjacent segments,andare respectively a trackAndthe features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient scoreThe formula is as follows:
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T isWherein,is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated asWhere c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collectionAnd collectionsPush buttonAnd c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the setAnd inner loop traversal setFor short-term relationship examplesBelong to a setAnd long term relationship examplesBelong to a set And andcalculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segmentConfidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n]) (10);
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
In a preferred embodiment of the present invention, after the generating the entity track feature in step 1), the method further includes: and setting a vIoU threshold, and removing the entity tracks below the threshold (reducing similar tracks).
In a preferred embodiment of the present invention, the space-time graph convolution network module in step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;
in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and then each row of the affine matrix is normalized by a manhattan norm, wherein the formula is as follows:
wherein,the (i) th track is shown,it is shown that the j-th track,indicates the value of vIoU for the ith and jth traces, N indicates the total number of traces,representing the value of the ith row and the jth column in the affine matrix.
The geometric graph convolution network calculation formula is as follows:
Xg=norm(σ(AgXWg))
wherein A isgIs an affine matrix of the geometry convolution network,is a physical feature input to the geometry convolution network, Wg∈Rd×dIs an adaptive parameter matrix of a geometric convolution network, sigma is a nonlinear activation function, norm is a normalization function, XgIs the output of the geometry convolution network.
At the same time, the output X of the convolution network for the geometry mapgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
in the appearance graph convolution network, two different linear transformations are first applied to the entity features input to the appearance graph convolution network, and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix AaRescaling with softmax for each line, the formula is as follows:
wherein, XiDenotes the i-th entity feature, XjDenotes the jth entity characteristic, φ (X)i)TIndicates that the ith entity characteristic is transposed after being linearly transformed, phi' (X)j) Represents another linear transformation to the jth entity feature, exp () represents an exponential function with a natural constant e as the base, N represents the number of entity features,the values in the ith row and jth column of the appearance dependency matrix are shown.
The appearance graph convolution network calculation formula is as follows:
Xa=norm(σ(AaXWa))
wherein, WaIs an adaptive parameter matrix, X, of the appearance graph convolution networkaIs the output of the appearance graph convolution network.
Then, the output X of the network is convolved with the appearance map as wellaReLU activation and Layer-Norm are performed such that the appearance map is the output of the convolutional networkThe input and output dimensions remain consistent.
In a preferred embodiment of the invention, the output X of the geometry convolution network is the entity feature X input to the spatio-temporal convolution network modulegAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg))
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
In a preferred embodiment of the present invention, in the step 3), the feature map Z of the entity in the current segment obtained in the step 2) is input into the linear transformation layer and the softmax layer to obtain the vector V for predicting the entity classificationoThe formula is as follows:
Vi o=softmax(φo(Zi))(i∈[1,N])
wherein Z isiFeature vectors of the ith row in the table feature map Z; phi is ao(Zi) Represents a pair ZiPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), Vi oRepresents a vector VoThe ith element in (1).
In a preferred embodiment of the present invention, in the step 3), every two feature vectors in the feature map Z of the entity in the current segment obtained in the step 2) are paired to form a new feature vector<Subject, object>Dimension (N x (N-1), 2 d); obtaining a relative motion profile Zrm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram ofSplicing to generate a feature map Z ' with dimensions of (N x (N-1), 2d + d '), and generating a vector V of the predicted predicate distribution by the feature map Z ' through a linear transformation layer and a sigmoid layerpThe formula is as follows:
wherein phi ispRepresenting a linear transformation, sigmoid () representing a sigmoid layer, Zi,ZjRepresenting the ith and jth row vectors in the feature map Z,characteristic diagram Z of relative motionrmThe ith row and the jth column of elements,represents Zi,Zj,The three are spliced end to end.
Example 1
The video visual relationship data set VidVRD is used for testing the video relationship detection capability of the method. The VidVRD dataset contains a total of 1000 videos that are well labeled with object categories and corresponding trajectories. Visual relationships are labeled under 35 object classes and 132 predicate classes, labeled < subject, predicate, object >. Fig. 1 illustrates examples of VidVRD video visual relationship data, one visual relationship example represented by the relationship triplet < subject, predicate, object > and trajectory of subject and object.
The steps performed in this example are described below in conjunction with the specific technical solutions described above, as follows:
1) first 80% of the VidVRD data set was used as the pre-acquired training data set and the remaining 20% of the video data was used as the test video data. Dividing each video into a plurality of segments, each segment having 30 frames;
2) using the fine-tuned Faster R-CNN as the entity detector, generating entity detection frames on each frame, and connecting the frame-level entity frames in each segment to generate entity track features. Setting a threshold value of the vIoU, reducing similar tracks, and using the previous N tracks as the input of the association network, wherein N is set to be 5;
3) and taking three adjacent segments (the previous segment, the current segment and the next segment), and splicing the characteristics of the previous segment and the current segment and the characteristics of the current segment and the next segment respectively to be used as two branches to be input into a fully-connected space-time graph convolution network module. Each space-time graph convolution network module is formed by calculating a geometric graph convolution network and an appearance graph convolution network;
4) in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and the calculation formula of the geometric convolution network is as follows:
Xg=norm(σ(AgXWg)) (1)
wherein A isgIs an affine matrix of the geometry convolution network,is a physical feature input to the geometry convolution network, Wg∈Rd×dIs an adaptive parameter matrix of a geometric convolution network, sigma is a nonlinear activation function, norm is a normalization function, XgIs the output of the geometry convolution network.
Each row of the affine matrix is then normalized by the manhattan norm as follows:
at the same time, the output X of the convolution network for the geometry mapgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
5) in an appearance graph convolution network, two different linear transformations are first applied to entity features and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix AaRescaling with softmax for each line, the formula is as follows:
the appearance graph convolution network calculation formula is as follows:
Xa=norm(σ(AaXWa)) (4)
wherein, WaIs an adaptive parameter matrix, X, of the appearance graph convolution networkaIs the output of the appearance graph convolution network.
Then, the output X of the network is convolved with the appearance map as wellaReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.
6) For original feature X, output X of the geometry convolution networkgAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg)) (5)
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
7) Extracting the characteristic diagram of the entity in the current segment from the output of the two branches in an element addition modeOn the one hand, Z is input into a linear transformation layer and a softmax layer to obtain a vector V for predicting object classificationoThe formula is as follows:
Vi o=softmax(φo(Zi))(i∈[1,N]) (6)
on the other hand, every two feature vectors in Z can be paired to form a new one<Subject, object>The dimension is (N x (N-1), 2 d). Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram ofSplicing to generate a feature map with dimensions of (N x (N-1), 2d + d'), and generating a feature vector V of the predicted predicate distribution by the feature map through a linear transformation layer and a sigmoid layerpThe formula is as follows:
finally, each relationship instance triplet<Subject, predicate, object>V ofoAnd VpMultiplying the confidence coefficient scores, and acquiring n relation examples with the highest scores as the input of the association network for each segment;
8) feature vectors of subjects or objects from two adjacent segments are input into an embedded network, and then a confidence score alpha of appearance similarity of the two entities is calculated through a cosine similarity function, wherein the formula is as follows:
9) in order to simultaneously consider the geometric information and the appearance information, the vIoU and the confidence coefficient alpha are multiplied by the corresponding weight value and then added to obtain the final associated confidence coefficient scoreThe formula is as follows:
10) the set of all short-term relationship instances in the segment corresponding to the current time T isWherein,is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated asWhere c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,andrespectively, the trajectory of the entity corresponding to the subject and the trajectory of the entity corresponding to the object in the long-term instance. To the collectionAnd collectionsPush buttonAnd c sorting in descending order.
Then two layers of cyclic calculation are carried out, and the outer layer is traversed through the setAnd inner loop traversal setFor short-term relationship examplesBelong to a setAnd long term relationship examplesBelong to a set And andand (4) calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the two associations. For long-term relationship instances from mth segment to nth segmentConfidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n]) (10)
11) after the detection results are sorted in descending order of the confidence score, Recall @ K (K is 50 or 100) is used as an evaluation index of video visual relationship detection, and the evaluation index represents the proportion of correct video visual relationship examples detected in the former K detection results. The highest 5 results were obtained to evaluate the accuracy of the results.
In an example of implementation, the test results of the method of the invention are compared with the reference results provided by the VidVRD data set. The results are shown in FIGS. 2 to 4. Fig. 2 is a curve that the accuracy of using VRD-GCN on a VidVRD dataset varies with training epochs, the accuracy gradually increases with the increase of the number of training epochs, and tends to be stable after 25 epochs, fig. 3 shows an algorithm iteration convergence curve, as the number of training epochs increases, a loss function value gradually decreases, the decrease range is slow after 25 epochs, and the loss value is close to 0, fig. 4 shows that the detection result of the VRD-GCN video relationship is compared with a reference result, for the same video segment, the detection result of the invention is richer and more accurate than that of the VidVRD of the reference method.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (9)
1. A video relation detection method based on a space-time diagram is characterized by comprising the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
2) respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
2. The method according to claim 1, wherein the step 1) comprises:
dividing the video into a plurality of segments, each segment comprising a plurality of frames; for each fragment, generating an entity detection frame on each frame, extracting entity characteristics, and connecting the frame-level entity frames in each fragment to generate entity track characteristics; and sorting the generated entity tracks in a descending order according to the vIoU value, and taking the first N tracks as the entity track characteristics of the segment.
3. The method for detecting video relationship based on space-time diagram according to claim 2, wherein after generating the entity track feature in step 1), the method further comprises: and setting a vIoU threshold value, and removing the entity track below the threshold value.
4. The spatio-temporal graph-based video relationship detection method according to claim 1, wherein the spatio-temporal graph convolution network module in the step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;
in a geometric graph convolution network, a vIoU value is used as a value in an affine matrix, and then each row of the affine matrix is normalized by a Manhattan norm; output X for geometry convolution networkgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
applying two different linear transformations to the real input to the appearance graph convolution networkThe volume features are then multiplied to obtain appearance correlation values, which make up an appearance correlation matrix AaRescaling with softmax for each line; output X to the appearance graph convolutional networkaReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.
5. The spatio-temporal graph-based video relationship detection method according to claim 4, wherein, for the entity feature X input to the spatio-temporal convolution network module, the output X of the geometric convolution networkgAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg))
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
6. The method according to claim 1, wherein in step 3),
inputting the feature graph Z of the entity in the current segment obtained in the step 2) into a linear transformation layer and a softmax layer to obtain a vector V for predicting entity classification, wherein the formula is as follows:
Vi o=softmax(φo(Zi)) (i∈[1,N])
wherein Z isiFeature vectors of the ith row in the table feature map Z; phi is ao(Zi) Represents a pair ZiPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), Vi oRepresenting the i-th element in the vector V deg..
7. The method according to claim 1, wherein in step 3),
pairing every two eigenvectors in the characteristic diagram Z of the entity in the current segment obtained in the step 2) to form a new eigenvector<Subject, object>Dimension (N x (N-1), 2 d); obtainingRelative motion profile Zrm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram ofSplicing to generate a feature map Z ' with dimensions of (N x (N-1), 2d + d '), and generating a vector V of the predicted predicate distribution by the feature map Z ' through a linear transformation layer and a sigmoid layerpThe formula is as follows:
8. The method for detecting video relationship based on spatio-temporal graph according to claim 1, wherein in the step 4), the association module with twin network processes as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
where emb () represents the embedded network,a function representing the degree of similarity of the cosine,andis any two tracks in two adjacent segments,andare respectively a trackAndthe features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient scoreThe formula is as follows:
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T isWherein,is the confidence score of the short-term instance, the vector V DEG of the predicted entity classification and the vector V of the predicted predicate distributionpAs a result of the multiplication, the result of the multiplication,is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated asWhere c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,andrespectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collectionAnd collectionsPush buttonAnd c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the setAnd inner loop traversal setFor short-term relationship examplesBelong to a setAnd long term relationship examplesBelong to a set And andcalculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segmentConfidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct) (t∈[m,n])。
9. a video relationship detection system based on a space-time diagram, comprising:
the characteristic extraction module is used for acquiring the frame-level entity characteristics of the video clips and connecting the frame-level entity frames in each clip to generate entity track characteristics;
the characteristic splicing module is used for respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and the spliced entity characteristics and the entity track characteristics are used as two branch inputs of the fully-connected space-time graph convolution network module;
the fully-connected space-time graph convolution network module is provided with two branches and comprises a plurality of space-time graph convolution network modules; each space-time graph convolution network module consists of a geometric graph convolution network and an appearance graph convolution network; adding the entity characteristics input into the space-time convolution network module with the output of the geometric graph convolution network in the space-time convolution network module and the output of the appearance convolution network to obtain an output result of the space-time convolution network module, and taking the output result after ReLU activation and normalization as the input of the next space-time graph convolution network module;
the characteristic graph extraction module extracts the characteristic graph of the entity in the current segment from the output of the two branches of the fully-connected space-time graph convolution network module in an element addition mode;
a first feature vector generation unit for obtaining a vector of the predicted entity classification;
a second feature vector generation unit configured to obtain a vector of the predicted predicate distribution;
the relation instance module multiplies each vector for predicting entity classification by the vector of prediction predicate distribution, and for each video segment, the L relation instances with the highest scores in the multiplication results are taken as the input of the association module with the twin network;
the association module with the twin network combines the short-term relationship examples in the whole video by using an online association method of the twin network; acquiring an association confidence score;
and the detection result output module is used for arranging the detection results of the association module with the twin network in a descending order according to the confidence score and outputting the video relation detection result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011280036.1A CN112347965A (en) | 2020-11-16 | 2020-11-16 | Video relation detection method and system based on space-time diagram |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011280036.1A CN112347965A (en) | 2020-11-16 | 2020-11-16 | Video relation detection method and system based on space-time diagram |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112347965A true CN112347965A (en) | 2021-02-09 |
Family
ID=74362926
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011280036.1A Pending CN112347965A (en) | 2020-11-16 | 2020-11-16 | Video relation detection method and system based on space-time diagram |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112347965A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112883868A (en) * | 2021-02-10 | 2021-06-01 | 中国科学技术大学 | Training method of weak surveillance video motion positioning model based on relational modeling |
CN113569559A (en) * | 2021-07-23 | 2021-10-29 | 北京智慧星光信息技术有限公司 | Short text entity emotion analysis method and system, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125406A (en) * | 2019-12-23 | 2020-05-08 | 天津大学 | Visual relation detection method based on self-adaptive cluster learning |
-
2020
- 2020-11-16 CN CN202011280036.1A patent/CN112347965A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125406A (en) * | 2019-12-23 | 2020-05-08 | 天津大学 | Visual relation detection method based on self-adaptive cluster learning |
Non-Patent Citations (1)
Title |
---|
XUFENG QIAN等: "Video Relation Detection with Spatio-Temporal Graph", 《SESSION 1A: MULTIMODAL FUSION & VISUAL RELATIONS》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112883868A (en) * | 2021-02-10 | 2021-06-01 | 中国科学技术大学 | Training method of weak surveillance video motion positioning model based on relational modeling |
CN112883868B (en) * | 2021-02-10 | 2022-07-15 | 中国科学技术大学 | Training method of weak supervision video motion positioning model based on relational modeling |
CN113569559A (en) * | 2021-07-23 | 2021-10-29 | 北京智慧星光信息技术有限公司 | Short text entity emotion analysis method and system, electronic equipment and storage medium |
CN113569559B (en) * | 2021-07-23 | 2024-02-02 | 北京智慧星光信息技术有限公司 | Short text entity emotion analysis method, system, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alani et al. | Hand gesture recognition using an adapted convolutional neural network with data augmentation | |
Heo et al. | Deepfake detection algorithm based on improved vision transformer | |
CN106897738A (en) | A kind of pedestrian detection method based on semi-supervised learning | |
Fang et al. | Orthogonal self-guided similarity preserving projection for classification and clustering | |
Reddy et al. | AdaCrowd: Unlabeled scene adaptation for crowd counting | |
CN113297936A (en) | Volleyball group behavior identification method based on local graph convolution network | |
CN116363738A (en) | Face recognition method, system and storage medium based on multiple moving targets | |
CN111144220B (en) | Personnel detection method, device, equipment and medium suitable for big data | |
CN112347965A (en) | Video relation detection method and system based on space-time diagram | |
CN111462184B (en) | Online sparse prototype tracking method based on twin neural network linear representation model | |
Samadiani et al. | A multiple feature fusion framework for video emotion recognition in the wild | |
CN115223239B (en) | Gesture recognition method, gesture recognition system, computer equipment and readable storage medium | |
Simao et al. | Improving novelty detection with generative adversarial networks on hand gesture data | |
Yao et al. | Recurrent graph convolutional autoencoder for unsupervised skeleton-based action recognition | |
CN112927266A (en) | Weak supervision time domain action positioning method and system based on uncertainty guide training | |
CN112668438A (en) | Infrared video time sequence behavior positioning method, device, equipment and storage medium | |
Qiao et al. | HyperSOR: Context-aware graph hypernetwork for salient object ranking | |
Yang et al. | A feature learning approach for face recognition with robustness to noisy label based on top-N prediction | |
Sun et al. | Dual GroupGAN: An unsupervised four-competitor (2V2) approach for video anomaly detection | |
CN111582057B (en) | Face verification method based on local receptive field | |
Negi et al. | End-to-end residual learning-based deep neural network model deployment for human activity recognition | |
Pryor et al. | Deepfake detection analyzing hybrid dataset utilizing CNN and SVM | |
CN116311345A (en) | Transformer-based pedestrian shielding re-recognition method | |
CN113869193B (en) | Training method of pedestrian re-recognition model, pedestrian re-recognition method and system | |
CN112487927B (en) | Method and system for realizing indoor scene recognition based on object associated attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210209 |