CN112347965A - Video relation detection method and system based on space-time diagram - Google Patents

Video relation detection method and system based on space-time diagram Download PDF

Info

Publication number
CN112347965A
CN112347965A CN202011280036.1A CN202011280036A CN112347965A CN 112347965 A CN112347965 A CN 112347965A CN 202011280036 A CN202011280036 A CN 202011280036A CN 112347965 A CN112347965 A CN 112347965A
Authority
CN
China
Prior art keywords
entity
video
convolution network
segment
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011280036.1A
Other languages
Chinese (zh)
Inventor
庄越挺
肖俊
汤斯亮
吴飞
杨易
李晓林
谭炽烈
蒋韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Tongdun Holdings Co Ltd
Original Assignee
Zhejiang University ZJU
Tongdun Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Tongdun Holdings Co Ltd filed Critical Zhejiang University ZJU
Priority to CN202011280036.1A priority Critical patent/CN112347965A/en
Publication of CN112347965A publication Critical patent/CN112347965A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video relation detection method and system based on a space-time diagram. First, the set of entities in the video and their relationships are modeled as a fully connected spatio-temporal graph that includes the entity nodes in the neighborhood of the temporal and spatial dimensions. For relationship detection, the invention provides a video relationship detection graph convolution network model (VRD-GCN) used for aggregating information from context and reasoning in the space-time graph. In one aspect, the VRD-GCN detects dynamic relationships between entities by capturing the relative changes in geometry and appearance of the entities in the spatiotemporal dimension. On the other hand, by passing messages of nodes and contexts in the neighborhood of the space-time diagram to the target entity, the VRD-GCN can generate more accurate and complete detection results. Upon detecting relationship instances in a video segment, short-term relationship instances throughout the video are merged using an online association method using a twin network. The method has high accuracy in relation detection in the video.

Description

Video relation detection method and system based on space-time diagram
Technical Field
The invention relates to video relation detection, a space-time graph convolutional neural network and a twin association network in machine learning and computer vision research, in particular to a video relation (visual relation) detection method and system based on a space-time graph.
Background
Understanding visual information is a primary goal of computer vision. Relationship detection in visual content requires the capture of fine-grained visual cues, including locating the locations of entities and the way they interact, which is a challenging but meaningful task. Although the relationships between objects in video are an important component of the deep understanding of dynamic visual content, relationship detection and reasoning in video is rarely studied. Successful attempts to detect video relationships will not only help us to build more efficient models for certain advanced visual understanding tasks (e.g., visual problem solving and visual captioning), but will also promote the development of other areas of computer vision, such as: video retrieval, video motion detection and video activity recognition.
A number of recent studies have achieved exciting and important results in static image relationship detection. A natural solution for relationship detection in video is to extend these methods directly to video. However, satisfactory results cannot be obtained due to the inherent differences between images and video. Methods designed for static image relationship detection and reasoning tend to ignore dynamic interactions between entities, which always occur in video. Given the nature of video, a relationship detection and inference solution in video should be able to capture dynamic and time-varying relationships between entities. The paper "Video Visual relationship Detection" by Xindi Shang et al is the only attempt to focus on detecting relationships in Video so far, however, this approach has limited performance due in part to its lack of ability to gather cues from the surrounding environment.
Aiming at the problems, the invention provides a video relation detection method based on a space-time diagram. Unlike the aforementioned methods, the present method utilizes message communication between entities to perform video relationship prediction. In addition, in order to solve the problem that whether two tracks in the continuous section belong to the same entity cannot be determined only by means of geometric overlapping due to the generation of a scene change or track drift problem, the invention provides a novel online association method using a twin network, the method considers the geometric overlapping of appearance similarity and a relationship example, and the accuracy is greatly improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a video relation detection method and system based on a space-time diagram.
The invention firstly discloses a video relation detection method based on a space-time diagram, which comprises the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
2) respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
Preferably, the processing procedure of the associated module with the twin network is as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
Figure BDA0002780464860000021
where emb () represents the embedded network,
Figure BDA0002780464860000022
a function representing the degree of similarity of the cosine,
Figure BDA0002780464860000023
and
Figure BDA0002780464860000024
is any two tracks in two adjacent segments,
Figure BDA0002780464860000025
and
Figure BDA0002780464860000026
are respectively a track
Figure BDA0002780464860000027
And
Figure BDA0002780464860000028
the features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient score
Figure BDA0002780464860000029
The formula is as follows:
Figure BDA00027804648600000210
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T is
Figure BDA00027804648600000211
Wherein,
Figure BDA00027804648600000212
is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,
Figure BDA00027804648600000213
is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000214
and
Figure BDA00027804648600000215
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated as
Figure BDA00027804648600000216
Where c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000217
and
Figure BDA00027804648600000218
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collection
Figure BDA00027804648600000219
And collections
Figure BDA00027804648600000220
Push button
Figure BDA00027804648600000221
And c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the set
Figure BDA00027804648600000222
And inner loop traversal set
Figure BDA00027804648600000223
For short-term relationship examples
Figure BDA00027804648600000224
Belong to a set
Figure BDA00027804648600000225
And long term relationship examples
Figure BDA00027804648600000226
Belong to a set
Figure BDA00027804648600000227
Figure BDA00027804648600000228
And
Figure BDA0002780464860000031
Figure BDA0002780464860000032
and
Figure BDA0002780464860000033
calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segment
Figure BDA0002780464860000034
Confidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n])。
the invention also discloses a video relation detection system based on the space-time diagram, which comprises the following steps:
the characteristic extraction module is used for acquiring the frame-level entity characteristics of the video clips and connecting the frame-level entity frames in each clip to generate entity track characteristics;
the characteristic splicing module is used for respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and the spliced entity characteristics and the entity track characteristics are used as two branch inputs of the fully-connected space-time graph convolution network module;
the fully-connected space-time graph convolution network module is provided with two branches and comprises a plurality of space-time graph convolution network modules; each space-time graph convolution network module consists of a geometric graph convolution network and an appearance graph convolution network; adding the entity characteristics input into the space-time convolution network module with the output of the geometric graph convolution network in the space-time convolution network module and the output of the appearance convolution network to obtain an output result of the space-time convolution network module, and taking the output result after ReLU activation and normalization as the input of the next space-time graph convolution network module;
the characteristic graph extraction module extracts the characteristic graph of the entity in the current segment from the output of the two branches of the fully-connected space-time graph convolution network module in an element addition mode;
a first feature vector generation unit for obtaining a vector of the predicted entity classification;
a second feature vector generation unit configured to obtain a vector of the predicted predicate distribution;
the relation instance module multiplies each vector for predicting entity classification by the vector of prediction predicate distribution, and for each video segment, the L relation instances with the highest scores in the multiplication results are taken as the input of the association module with the twin network;
the association module with the twin network combines the short-term relationship examples in the whole video by using an online association method of the twin network; acquiring an association confidence score;
and the detection result output module is used for arranging the detection results of the association module with the twin network in a descending order according to the confidence score and outputting the video relation detection result.
Because the association method with the twin network is adopted, the problem that the greedy association algorithm adopted in the prior art only utilizes geometric information and the algorithm result is inaccurate when the track generation is inaccurate or the track drift problem occurs is solved, so that the accuracy of the track association result is effectively improved, and the performance of the association algorithm is improved. In addition, the video relation detection model VRD-GCN based on the space-time diagram abstracts videos into the fully-connected space-time diagram, transmits messages in the space-time diagram and carries out reasoning, and the method is novel and has excellent video relation detection results.
Drawings
FIG. 1 sample VidVRD video visual relationship data;
FIG. 2 is a graph of accuracy over training epoch using VRD-GCN on VidVRD datasets;
FIG. 3 is an algorithm iteration convergence curve;
FIG. 4 is a comparison of VRD-GCN video relationship detection results with reference results.
FIG. 5 is a flow chart of a method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 5, the video relationship detection method based on the space-time diagram of the present invention includes the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
dividing the video into a plurality of segments, each segment comprising a plurality of frames; for each fragment, generating an entity detection frame on each frame, extracting entity characteristics, and connecting the frame-level entity frames in each fragment to generate entity track characteristics; and sorting the generated entity tracks in a descending order according to the vIoU value, and taking the first N tracks as the entity track characteristics of the segment.
2) Respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
specifically, the step 4) is as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
Figure BDA0002780464860000051
where emb () represents the embedded network,
Figure BDA0002780464860000052
a function representing the degree of similarity of the cosine,
Figure BDA0002780464860000053
and
Figure BDA0002780464860000054
is any two tracks in two adjacent segments,
Figure BDA0002780464860000055
and
Figure BDA0002780464860000056
are respectively a track
Figure BDA0002780464860000057
And
Figure BDA0002780464860000058
the features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient score
Figure BDA0002780464860000059
The formula is as follows:
Figure BDA00027804648600000510
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T is
Figure BDA00027804648600000511
Wherein,
Figure BDA00027804648600000512
is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,
Figure BDA00027804648600000513
is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000514
and
Figure BDA00027804648600000515
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated as
Figure BDA00027804648600000516
Where c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000517
and
Figure BDA00027804648600000518
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collection
Figure BDA00027804648600000519
And collections
Figure BDA00027804648600000520
Push button
Figure BDA00027804648600000521
And c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the set
Figure BDA00027804648600000522
And inner loop traversal set
Figure BDA00027804648600000523
For short-term relationship examples
Figure BDA00027804648600000524
Belong to a set
Figure BDA00027804648600000532
And long term relationship examples
Figure BDA00027804648600000525
Belong to a set
Figure BDA00027804648600000526
Figure BDA00027804648600000527
And
Figure BDA00027804648600000528
Figure BDA00027804648600000529
and
Figure BDA00027804648600000530
calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segment
Figure BDA00027804648600000531
Confidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n]) (10);
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
In a preferred embodiment of the present invention, after the generating the entity track feature in step 1), the method further includes: and setting a vIoU threshold, and removing the entity tracks below the threshold (reducing similar tracks).
In a preferred embodiment of the present invention, the space-time graph convolution network module in step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;
in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and then each row of the affine matrix is normalized by a manhattan norm, wherein the formula is as follows:
Figure BDA0002780464860000061
wherein,
Figure BDA0002780464860000062
the (i) th track is shown,
Figure BDA0002780464860000063
it is shown that the j-th track,
Figure BDA0002780464860000064
indicates the value of vIoU for the ith and jth traces, N indicates the total number of traces,
Figure BDA0002780464860000065
representing the value of the ith row and the jth column in the affine matrix.
The geometric graph convolution network calculation formula is as follows:
Xg=norm(σ(AgXWg))
wherein A isgIs an affine matrix of the geometry convolution network,
Figure BDA0002780464860000066
is a physical feature input to the geometry convolution network, Wg∈Rd×dIs an adaptive parameter matrix of a geometric convolution network, sigma is a nonlinear activation function, norm is a normalization function, XgIs the output of the geometry convolution network.
At the same time, the output X of the convolution network for the geometry mapgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
in the appearance graph convolution network, two different linear transformations are first applied to the entity features input to the appearance graph convolution network, and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix AaRescaling with softmax for each line, the formula is as follows:
Figure BDA0002780464860000067
wherein, XiDenotes the i-th entity feature, XjDenotes the jth entity characteristic, φ (X)i)TIndicates that the ith entity characteristic is transposed after being linearly transformed, phi' (X)j) Represents another linear transformation to the jth entity feature, exp () represents an exponential function with a natural constant e as the base, N represents the number of entity features,
Figure BDA0002780464860000068
the values in the ith row and jth column of the appearance dependency matrix are shown.
The appearance graph convolution network calculation formula is as follows:
Xa=norm(σ(AaXWa))
wherein, WaIs an adaptive parameter matrix, X, of the appearance graph convolution networkaIs the output of the appearance graph convolution network.
Then, the output X of the network is convolved with the appearance map as wellaReLU activation and Layer-Norm are performed such that the appearance map is the output of the convolutional networkThe input and output dimensions remain consistent.
In a preferred embodiment of the invention, the output X of the geometry convolution network is the entity feature X input to the spatio-temporal convolution network modulegAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg))
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
In a preferred embodiment of the present invention, in the step 3), the feature map Z of the entity in the current segment obtained in the step 2) is input into the linear transformation layer and the softmax layer to obtain the vector V for predicting the entity classificationoThe formula is as follows:
Vi o=softmax(φo(Zi))(i∈[1,N])
wherein Z isiFeature vectors of the ith row in the table feature map Z; phi is ao(Zi) Represents a pair ZiPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), Vi oRepresents a vector VoThe ith element in (1).
In a preferred embodiment of the present invention, in the step 3), every two feature vectors in the feature map Z of the entity in the current segment obtained in the step 2) are paired to form a new feature vector<Subject, object>Dimension (N x (N-1), 2 d); obtaining a relative motion profile Zrm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of
Figure BDA0002780464860000071
Splicing to generate a feature map Z ' with dimensions of (N x (N-1), 2d + d '), and generating a vector V of the predicted predicate distribution by the feature map Z ' through a linear transformation layer and a sigmoid layerpThe formula is as follows:
Figure BDA0002780464860000072
wherein phi ispRepresenting a linear transformation, sigmoid () representing a sigmoid layer, Zi,ZjRepresenting the ith and jth row vectors in the feature map Z,
Figure BDA0002780464860000073
characteristic diagram Z of relative motionrmThe ith row and the jth column of elements,
Figure BDA0002780464860000074
represents Zi,Zj
Figure BDA0002780464860000075
The three are spliced end to end.
Example 1
The video visual relationship data set VidVRD is used for testing the video relationship detection capability of the method. The VidVRD dataset contains a total of 1000 videos that are well labeled with object categories and corresponding trajectories. Visual relationships are labeled under 35 object classes and 132 predicate classes, labeled < subject, predicate, object >. Fig. 1 illustrates examples of VidVRD video visual relationship data, one visual relationship example represented by the relationship triplet < subject, predicate, object > and trajectory of subject and object.
The steps performed in this example are described below in conjunction with the specific technical solutions described above, as follows:
1) first 80% of the VidVRD data set was used as the pre-acquired training data set and the remaining 20% of the video data was used as the test video data. Dividing each video into a plurality of segments, each segment having 30 frames;
2) using the fine-tuned Faster R-CNN as the entity detector, generating entity detection frames on each frame, and connecting the frame-level entity frames in each segment to generate entity track features. Setting a threshold value of the vIoU, reducing similar tracks, and using the previous N tracks as the input of the association network, wherein N is set to be 5;
3) and taking three adjacent segments (the previous segment, the current segment and the next segment), and splicing the characteristics of the previous segment and the current segment and the characteristics of the current segment and the next segment respectively to be used as two branches to be input into a fully-connected space-time graph convolution network module. Each space-time graph convolution network module is formed by calculating a geometric graph convolution network and an appearance graph convolution network;
4) in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and the calculation formula of the geometric convolution network is as follows:
Xg=norm(σ(AgXWg)) (1)
wherein A isgIs an affine matrix of the geometry convolution network,
Figure BDA0002780464860000081
is a physical feature input to the geometry convolution network, Wg∈Rd×dIs an adaptive parameter matrix of a geometric convolution network, sigma is a nonlinear activation function, norm is a normalization function, XgIs the output of the geometry convolution network.
Each row of the affine matrix is then normalized by the manhattan norm as follows:
Figure BDA0002780464860000082
at the same time, the output X of the convolution network for the geometry mapgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
5) in an appearance graph convolution network, two different linear transformations are first applied to entity features and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix AaRescaling with softmax for each line, the formula is as follows:
Figure BDA0002780464860000083
the appearance graph convolution network calculation formula is as follows:
Xa=norm(σ(AaXWa)) (4)
wherein, WaIs an adaptive parameter matrix, X, of the appearance graph convolution networkaIs the output of the appearance graph convolution network.
Then, the output X of the network is convolved with the appearance map as wellaReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.
6) For original feature X, output X of the geometry convolution networkgAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg)) (5)
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
7) Extracting the characteristic diagram of the entity in the current segment from the output of the two branches in an element addition mode
Figure BDA0002780464860000091
On the one hand, Z is input into a linear transformation layer and a softmax layer to obtain a vector V for predicting object classificationoThe formula is as follows:
Vi o=softmax(φo(Zi))(i∈[1,N]) (6)
on the other hand, every two feature vectors in Z can be paired to form a new one<Subject, object>The dimension is (N x (N-1), 2 d). Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of
Figure BDA0002780464860000092
Splicing to generate a feature map with dimensions of (N x (N-1), 2d + d'), and generating a feature vector V of the predicted predicate distribution by the feature map through a linear transformation layer and a sigmoid layerpThe formula is as follows:
Figure BDA0002780464860000093
finally, each relationship instance triplet<Subject, predicate, object>V ofoAnd VpMultiplying the confidence coefficient scores, and acquiring n relation examples with the highest scores as the input of the association network for each segment;
8) feature vectors of subjects or objects from two adjacent segments are input into an embedded network, and then a confidence score alpha of appearance similarity of the two entities is calculated through a cosine similarity function, wherein the formula is as follows:
Figure BDA0002780464860000094
wherein,
Figure BDA0002780464860000095
and
Figure BDA0002780464860000096
is any two tracks in a continuous segment;
9) in order to simultaneously consider the geometric information and the appearance information, the vIoU and the confidence coefficient alpha are multiplied by the corresponding weight value and then added to obtain the final associated confidence coefficient score
Figure BDA0002780464860000097
The formula is as follows:
Figure BDA0002780464860000098
10) the set of all short-term relationship instances in the segment corresponding to the current time T is
Figure BDA0002780464860000099
Wherein,
Figure BDA00027804648600000910
is a confidence score for a short-term instance, vector V for a predicted entity classoSum of predicted predicate distributions vector VpAs a result of the multiplication, the result of the multiplication,
Figure BDA00027804648600000911
is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000912
and
Figure BDA00027804648600000913
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated as
Figure BDA00027804648600000914
Where c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,
Figure BDA00027804648600000915
and
Figure BDA00027804648600000916
respectively, the trajectory of the entity corresponding to the subject and the trajectory of the entity corresponding to the object in the long-term instance. To the collection
Figure BDA00027804648600000917
And collections
Figure BDA00027804648600000918
Push button
Figure BDA00027804648600000919
And c sorting in descending order.
Then two layers of cyclic calculation are carried out, and the outer layer is traversed through the set
Figure BDA00027804648600001011
And inner loop traversal set
Figure BDA0002780464860000101
For short-term relationship examples
Figure BDA0002780464860000102
Belong to a set
Figure BDA0002780464860000103
And long term relationship examples
Figure BDA0002780464860000104
Belong to a set
Figure BDA0002780464860000105
Figure BDA0002780464860000106
And
Figure BDA0002780464860000107
Figure BDA0002780464860000108
and
Figure BDA0002780464860000109
and (4) calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the two associations. For long-term relationship instances from mth segment to nth segment
Figure BDA00027804648600001010
Confidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct)(t∈[m,n]) (10)
11) after the detection results are sorted in descending order of the confidence score, Recall @ K (K is 50 or 100) is used as an evaluation index of video visual relationship detection, and the evaluation index represents the proportion of correct video visual relationship examples detected in the former K detection results. The highest 5 results were obtained to evaluate the accuracy of the results.
In an example of implementation, the test results of the method of the invention are compared with the reference results provided by the VidVRD data set. The results are shown in FIGS. 2 to 4. Fig. 2 is a curve that the accuracy of using VRD-GCN on a VidVRD dataset varies with training epochs, the accuracy gradually increases with the increase of the number of training epochs, and tends to be stable after 25 epochs, fig. 3 shows an algorithm iteration convergence curve, as the number of training epochs increases, a loss function value gradually decreases, the decrease range is slow after 25 epochs, and the loss value is close to 0, fig. 4 shows that the detection result of the VRD-GCN video relationship is compared with a reference result, for the same video segment, the detection result of the invention is richer and more accurate than that of the VidVRD of the reference method.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (9)

1. A video relation detection method based on a space-time diagram is characterized by comprising the following steps:
1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;
2) respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;
3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;
4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;
5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.
2. The method according to claim 1, wherein the step 1) comprises:
dividing the video into a plurality of segments, each segment comprising a plurality of frames; for each fragment, generating an entity detection frame on each frame, extracting entity characteristics, and connecting the frame-level entity frames in each fragment to generate entity track characteristics; and sorting the generated entity tracks in a descending order according to the vIoU value, and taking the first N tracks as the entity track characteristics of the segment.
3. The method for detecting video relationship based on space-time diagram according to claim 2, wherein after generating the entity track feature in step 1), the method further comprises: and setting a vIoU threshold value, and removing the entity track below the threshold value.
4. The spatio-temporal graph-based video relationship detection method according to claim 1, wherein the spatio-temporal graph convolution network module in the step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;
in a geometric graph convolution network, a vIoU value is used as a value in an affine matrix, and then each row of the affine matrix is normalized by a Manhattan norm; output X for geometry convolution networkgReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;
applying two different linear transformations to the real input to the appearance graph convolution networkThe volume features are then multiplied to obtain appearance correlation values, which make up an appearance correlation matrix AaRescaling with softmax for each line; output X to the appearance graph convolutional networkaReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.
5. The spatio-temporal graph-based video relationship detection method according to claim 4, wherein, for the entity feature X input to the spatio-temporal convolution network module, the output X of the geometric convolution networkgAnd the output X of the appearance convolution networkaAdding according to the following formula:
X′=norm(σ(Xa+X+Xg))
and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.
6. The method according to claim 1, wherein in step 3),
inputting the feature graph Z of the entity in the current segment obtained in the step 2) into a linear transformation layer and a softmax layer to obtain a vector V for predicting entity classification, wherein the formula is as follows:
Vi o=softmax(φo(Zi)) (i∈[1,N])
wherein Z isiFeature vectors of the ith row in the table feature map Z; phi is ao(Zi) Represents a pair ZiPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), Vi oRepresenting the i-th element in the vector V deg..
7. The method according to claim 1, wherein in step 3),
pairing every two eigenvectors in the characteristic diagram Z of the entity in the current segment obtained in the step 2) to form a new eigenvector<Subject, object>Dimension (N x (N-1), 2 d); obtainingRelative motion profile Zrm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of
Figure FDA0002780464850000021
Splicing to generate a feature map Z ' with dimensions of (N x (N-1), 2d + d '), and generating a vector V of the predicted predicate distribution by the feature map Z ' through a linear transformation layer and a sigmoid layerpThe formula is as follows:
Figure FDA0002780464850000022
wherein phi ispRepresenting a linear transformation, sigmoid () representing a sigmoid layer, Zi,ZjRepresenting the ith and jth row vectors in the feature map Z,
Figure FDA0002780464850000023
characteristic diagram Z of relative motionrmThe ith row and the jth column of elements,
Figure FDA0002780464850000024
represents Zi,Zj
Figure FDA0002780464850000025
The three are spliced end to end.
8. The method for detecting video relationship based on spatio-temporal graph according to claim 1, wherein in the step 4), the association module with twin network processes as follows:
4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:
Figure FDA0002780464850000026
where emb () represents the embedded network,
Figure FDA0002780464850000031
a function representing the degree of similarity of the cosine,
Figure FDA0002780464850000032
and
Figure FDA0002780464850000033
is any two tracks in two adjacent segments,
Figure FDA0002780464850000034
and
Figure FDA0002780464850000035
are respectively a track
Figure FDA0002780464850000036
And
Figure FDA0002780464850000037
the features of (1);
4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient score
Figure FDA0002780464850000038
The formula is as follows:
Figure FDA0002780464850000039
4.3) the set of all short-term relationship instances in the segment corresponding to the current time T is
Figure FDA00027804648500000310
Wherein,
Figure FDA00027804648500000311
is the confidence score of the short-term instance, the vector V DEG of the predicted entity classification and the vector V of the predicted predicate distributionpAs a result of the multiplication, the result of the multiplication,
Figure FDA00027804648500000312
is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,
Figure FDA00027804648500000313
and
Figure FDA00027804648500000314
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated as
Figure FDA00027804648500000315
Where c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,
Figure FDA00027804648500000316
and
Figure FDA00027804648500000317
respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collection
Figure FDA00027804648500000318
And collections
Figure FDA00027804648500000319
Push button
Figure FDA00027804648500000320
And c, sorting in descending order;
then two layers of cyclic calculation are carried out, and the outer layer is traversed through the set
Figure FDA00027804648500000321
And inner loop traversal set
Figure FDA00027804648500000322
For short-term relationship examples
Figure FDA00027804648500000323
Belong to a set
Figure FDA00027804648500000324
And long term relationship examples
Figure FDA00027804648500000325
Belong to a set
Figure FDA00027804648500000326
Figure FDA00027804648500000327
And
Figure FDA00027804648500000328
Figure FDA00027804648500000329
and
Figure FDA00027804648500000330
calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segment
Figure FDA00027804648500000331
Confidence score of (c)pUpdated with the highest score of all short-term relationship instances in p, as follows:
cp=max(ct) (t∈[m,n])。
9. a video relationship detection system based on a space-time diagram, comprising:
the characteristic extraction module is used for acquiring the frame-level entity characteristics of the video clips and connecting the frame-level entity frames in each clip to generate entity track characteristics;
the characteristic splicing module is used for respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and the spliced entity characteristics and the entity track characteristics are used as two branch inputs of the fully-connected space-time graph convolution network module;
the fully-connected space-time graph convolution network module is provided with two branches and comprises a plurality of space-time graph convolution network modules; each space-time graph convolution network module consists of a geometric graph convolution network and an appearance graph convolution network; adding the entity characteristics input into the space-time convolution network module with the output of the geometric graph convolution network in the space-time convolution network module and the output of the appearance convolution network to obtain an output result of the space-time convolution network module, and taking the output result after ReLU activation and normalization as the input of the next space-time graph convolution network module;
the characteristic graph extraction module extracts the characteristic graph of the entity in the current segment from the output of the two branches of the fully-connected space-time graph convolution network module in an element addition mode;
a first feature vector generation unit for obtaining a vector of the predicted entity classification;
a second feature vector generation unit configured to obtain a vector of the predicted predicate distribution;
the relation instance module multiplies each vector for predicting entity classification by the vector of prediction predicate distribution, and for each video segment, the L relation instances with the highest scores in the multiplication results are taken as the input of the association module with the twin network;
the association module with the twin network combines the short-term relationship examples in the whole video by using an online association method of the twin network; acquiring an association confidence score;
and the detection result output module is used for arranging the detection results of the association module with the twin network in a descending order according to the confidence score and outputting the video relation detection result.
CN202011280036.1A 2020-11-16 2020-11-16 Video relation detection method and system based on space-time diagram Pending CN112347965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011280036.1A CN112347965A (en) 2020-11-16 2020-11-16 Video relation detection method and system based on space-time diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011280036.1A CN112347965A (en) 2020-11-16 2020-11-16 Video relation detection method and system based on space-time diagram

Publications (1)

Publication Number Publication Date
CN112347965A true CN112347965A (en) 2021-02-09

Family

ID=74362926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011280036.1A Pending CN112347965A (en) 2020-11-16 2020-11-16 Video relation detection method and system based on space-time diagram

Country Status (1)

Country Link
CN (1) CN112347965A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883868A (en) * 2021-02-10 2021-06-01 中国科学技术大学 Training method of weak surveillance video motion positioning model based on relational modeling
CN113569559A (en) * 2021-07-23 2021-10-29 北京智慧星光信息技术有限公司 Short text entity emotion analysis method and system, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125406A (en) * 2019-12-23 2020-05-08 天津大学 Visual relation detection method based on self-adaptive cluster learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125406A (en) * 2019-12-23 2020-05-08 天津大学 Visual relation detection method based on self-adaptive cluster learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUFENG QIAN等: "Video Relation Detection with Spatio-Temporal Graph", 《SESSION 1A: MULTIMODAL FUSION & VISUAL RELATIONS》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883868A (en) * 2021-02-10 2021-06-01 中国科学技术大学 Training method of weak surveillance video motion positioning model based on relational modeling
CN112883868B (en) * 2021-02-10 2022-07-15 中国科学技术大学 Training method of weak supervision video motion positioning model based on relational modeling
CN113569559A (en) * 2021-07-23 2021-10-29 北京智慧星光信息技术有限公司 Short text entity emotion analysis method and system, electronic equipment and storage medium
CN113569559B (en) * 2021-07-23 2024-02-02 北京智慧星光信息技术有限公司 Short text entity emotion analysis method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Alani et al. Hand gesture recognition using an adapted convolutional neural network with data augmentation
Heo et al. Deepfake detection algorithm based on improved vision transformer
CN106897738A (en) A kind of pedestrian detection method based on semi-supervised learning
Fang et al. Orthogonal self-guided similarity preserving projection for classification and clustering
Reddy et al. AdaCrowd: Unlabeled scene adaptation for crowd counting
CN113297936A (en) Volleyball group behavior identification method based on local graph convolution network
CN116363738A (en) Face recognition method, system and storage medium based on multiple moving targets
CN111144220B (en) Personnel detection method, device, equipment and medium suitable for big data
CN112347965A (en) Video relation detection method and system based on space-time diagram
CN111462184B (en) Online sparse prototype tracking method based on twin neural network linear representation model
Samadiani et al. A multiple feature fusion framework for video emotion recognition in the wild
CN115223239B (en) Gesture recognition method, gesture recognition system, computer equipment and readable storage medium
Simao et al. Improving novelty detection with generative adversarial networks on hand gesture data
Yao et al. Recurrent graph convolutional autoencoder for unsupervised skeleton-based action recognition
CN112927266A (en) Weak supervision time domain action positioning method and system based on uncertainty guide training
CN112668438A (en) Infrared video time sequence behavior positioning method, device, equipment and storage medium
Qiao et al. HyperSOR: Context-aware graph hypernetwork for salient object ranking
Yang et al. A feature learning approach for face recognition with robustness to noisy label based on top-N prediction
Sun et al. Dual GroupGAN: An unsupervised four-competitor (2V2) approach for video anomaly detection
CN111582057B (en) Face verification method based on local receptive field
Negi et al. End-to-end residual learning-based deep neural network model deployment for human activity recognition
Pryor et al. Deepfake detection analyzing hybrid dataset utilizing CNN and SVM
CN116311345A (en) Transformer-based pedestrian shielding re-recognition method
CN113869193B (en) Training method of pedestrian re-recognition model, pedestrian re-recognition method and system
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210209