CN112347965A

CN112347965A - Video relation detection method and system based on space-time diagram

Info

Publication number: CN112347965A
Application number: CN202011280036.1A
Authority: CN
Inventors: 庄越挺; 肖俊; 汤斯亮; 吴飞; 杨易; 李晓林; 谭炽烈; 蒋韬
Original assignee: Zhejiang University ZJU; Tongdun Holdings Co Ltd
Current assignee: Zhejiang University ZJU; Tongdun Holdings Co Ltd
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-02-09

Abstract

The invention discloses a video relation detection method and system based on a space-time diagram. First, the set of entities in the video and their relationships are modeled as a fully connected spatio-temporal graph that includes the entity nodes in the neighborhood of the temporal and spatial dimensions. For relationship detection, the invention provides a video relationship detection graph convolution network model (VRD-GCN) used for aggregating information from context and reasoning in the space-time graph. In one aspect, the VRD-GCN detects dynamic relationships between entities by capturing the relative changes in geometry and appearance of the entities in the spatiotemporal dimension. On the other hand, by passing messages of nodes and contexts in the neighborhood of the space-time diagram to the target entity, the VRD-GCN can generate more accurate and complete detection results. Upon detecting relationship instances in a video segment, short-term relationship instances throughout the video are merged using an online association method using a twin network. The method has high accuracy in relation detection in the video.

Description

Video relation detection method and system based on space-time diagram

Technical Field

The invention relates to video relation detection, a space-time graph convolutional neural network and a twin association network in machine learning and computer vision research, in particular to a video relation (visual relation) detection method and system based on a space-time graph.

Background

Understanding visual information is a primary goal of computer vision. Relationship detection in visual content requires the capture of fine-grained visual cues, including locating the locations of entities and the way they interact, which is a challenging but meaningful task. Although the relationships between objects in video are an important component of the deep understanding of dynamic visual content, relationship detection and reasoning in video is rarely studied. Successful attempts to detect video relationships will not only help us to build more efficient models for certain advanced visual understanding tasks (e.g., visual problem solving and visual captioning), but will also promote the development of other areas of computer vision, such as: video retrieval, video motion detection and video activity recognition.

A number of recent studies have achieved exciting and important results in static image relationship detection. A natural solution for relationship detection in video is to extend these methods directly to video. However, satisfactory results cannot be obtained due to the inherent differences between images and video. Methods designed for static image relationship detection and reasoning tend to ignore dynamic interactions between entities, which always occur in video. Given the nature of video, a relationship detection and inference solution in video should be able to capture dynamic and time-varying relationships between entities. The paper "Video Visual relationship Detection" by Xindi Shang et al is the only attempt to focus on detecting relationships in Video so far, however, this approach has limited performance due in part to its lack of ability to gather cues from the surrounding environment.

Aiming at the problems, the invention provides a video relation detection method based on a space-time diagram. Unlike the aforementioned methods, the present method utilizes message communication between entities to perform video relationship prediction. In addition, in order to solve the problem that whether two tracks in the continuous section belong to the same entity cannot be determined only by means of geometric overlapping due to the generation of a scene change or track drift problem, the invention provides a novel online association method using a twin network, the method considers the geometric overlapping of appearance similarity and a relationship example, and the accuracy is greatly improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a video relation detection method and system based on a space-time diagram.

The invention firstly discloses a video relation detection method based on a space-time diagram, which comprises the following steps:

1) acquiring the entity characteristics of the video clip at the frame level and the entity track characteristics of the video clip;

2) respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and inputting the spliced entity characteristics and the entity track characteristics as two branches into a fully-connected space-time graph convolution network module; extracting the characteristic graph of an entity in the current segment from the output of two branches of the fully-connected space-time graph convolutional network module in an element addition mode;

3) obtaining a vector for predicting entity classification and a vector for predicting predicate distribution;

4) multiplying each vector for predicting entity classification with the vector of prediction predicate distribution, and taking the L relation instances with the highest scores in the multiplication results as the input of the association module with the twin network for each video segment; merging short-term relationship instances in the whole video by using an online association method of a twin network; acquiring an association confidence score;

5) and (4) arranging the detection results in a descending order according to the confidence level scores to obtain a video relation detection result.

Preferably, the processing procedure of the associated module with the twin network is as follows:

4.1) inputting the feature vectors from any two tracks in two adjacent segments into a twin network, wherein the twin network is an embedded network consisting of three linear transformation layers, and then calculating the confidence coefficient alpha of the appearance similarity of the two entities by a cosine similarity function, wherein the formula is as follows:

where emb () represents the embedded network,

a function representing the degree of similarity of the cosine,

and

is any two tracks in two adjacent segments,

and

are respectively a track

And

the features of (1);

4.2) simultaneously considering the geometric information and the appearance information, multiplying the vIoU value and the confidence coefficient alpha with the corresponding weight value and then adding the multiplied values to obtain the final associated confidence coefficient score

The formula is as follows:

4.3) the set of all short-term relationship instances in the segment corresponding to the current time T is

Wherein,

is a confidence score for a short-term instance, vector V for a predicted entity class^oSum of predicted predicate distributions vector V^pAs a result of the multiplication, the result of the multiplication,

is short-term instance corresponds to<Subject, predicate, object>The three-way set of elements is,

and

respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the short-term example; all long-term relationship instances that have been detected for the segment before time T are aggregated as

Where c is the confidence score for the long-term instance, < s, p, o > is for the long-term instance<Subject, predicate, object>The three-way set of elements is,

and

respectively the track of the entity corresponding to the subject and the track of the entity corresponding to the object in the long-term example; to the collection

And collections

Push button

And c, sorting in descending order;

then two layers of cyclic calculation are carried out, and the outer layer is traversed through the set

And inner loop traversal set

For short-term relationship examples

Belong to a set

And long term relationship examples

Belong to a set

And

and

calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the association confidence scores; for long-term relationship instances from mth segment to nth segment

Confidence score of (c)_pUpdated with the highest score of all short-term relationship instances in p, as follows:

c_p＝max(c^t)(t∈[m,n])。

the invention also discloses a video relation detection system based on the space-time diagram, which comprises the following steps:

the characteristic extraction module is used for acquiring the frame-level entity characteristics of the video clips and connecting the frame-level entity frames in each clip to generate entity track characteristics;

the characteristic splicing module is used for respectively splicing the entity characteristics and the entity track characteristics of the previous segment and the current segment as well as the entity characteristics and the entity track characteristics of the current segment and the next segment, and the spliced entity characteristics and the entity track characteristics are used as two branch inputs of the fully-connected space-time graph convolution network module;

the fully-connected space-time graph convolution network module is provided with two branches and comprises a plurality of space-time graph convolution network modules; each space-time graph convolution network module consists of a geometric graph convolution network and an appearance graph convolution network; adding the entity characteristics input into the space-time convolution network module with the output of the geometric graph convolution network in the space-time convolution network module and the output of the appearance convolution network to obtain an output result of the space-time convolution network module, and taking the output result after ReLU activation and normalization as the input of the next space-time graph convolution network module;

the characteristic graph extraction module extracts the characteristic graph of the entity in the current segment from the output of the two branches of the fully-connected space-time graph convolution network module in an element addition mode;

a first feature vector generation unit for obtaining a vector of the predicted entity classification;

a second feature vector generation unit configured to obtain a vector of the predicted predicate distribution;

the relation instance module multiplies each vector for predicting entity classification by the vector of prediction predicate distribution, and for each video segment, the L relation instances with the highest scores in the multiplication results are taken as the input of the association module with the twin network;

the association module with the twin network combines the short-term relationship examples in the whole video by using an online association method of the twin network; acquiring an association confidence score;

and the detection result output module is used for arranging the detection results of the association module with the twin network in a descending order according to the confidence score and outputting the video relation detection result.

Because the association method with the twin network is adopted, the problem that the greedy association algorithm adopted in the prior art only utilizes geometric information and the algorithm result is inaccurate when the track generation is inaccurate or the track drift problem occurs is solved, so that the accuracy of the track association result is effectively improved, and the performance of the association algorithm is improved. In addition, the video relation detection model VRD-GCN based on the space-time diagram abstracts videos into the fully-connected space-time diagram, transmits messages in the space-time diagram and carries out reasoning, and the method is novel and has excellent video relation detection results.

Drawings

FIG. 1 sample VidVRD video visual relationship data;

FIG. 2 is a graph of accuracy over training epoch using VRD-GCN on VidVRD datasets;

FIG. 3 is an algorithm iteration convergence curve;

FIG. 4 is a comparison of VRD-GCN video relationship detection results with reference results.

FIG. 5 is a flow chart of a method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 5, the video relationship detection method based on the space-time diagram of the present invention includes the following steps:

dividing the video into a plurality of segments, each segment comprising a plurality of frames; for each fragment, generating an entity detection frame on each frame, extracting entity characteristics, and connecting the frame-level entity frames in each fragment to generate entity track characteristics; and sorting the generated entity tracks in a descending order according to the vIoU value, and taking the first N tracks as the entity track characteristics of the segment.

specifically, the step 4) is as follows:

where emb () represents the embedded network,

a function representing the degree of similarity of the cosine,

and

is any two tracks in two adjacent segments,

and

are respectively a track

And

the features of (1);

The formula is as follows:

Wherein,

and

and

And collections

Push button

And c, sorting in descending order;

And inner loop traversal set

For short-term relationship examples

Belong to a set

And long term relationship examples

Belong to a set

And

and

c_p＝max(c^t)(t∈[m,n]) (10)；

In a preferred embodiment of the present invention, after the generating the entity track feature in step 1), the method further includes: and setting a vIoU threshold, and removing the entity tracks below the threshold (reducing similar tracks).

In a preferred embodiment of the present invention, the space-time graph convolution network module in step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;

in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and then each row of the affine matrix is normalized by a manhattan norm, wherein the formula is as follows:

wherein,

the (i) th track is shown,

it is shown that the j-th track,

indicates the value of vIoU for the ith and jth traces, N indicates the total number of traces,

representing the value of the ith row and the jth column in the affine matrix.

The geometric graph convolution network calculation formula is as follows:

X^g＝norm(σ(A^gXW^g))

wherein A is^gIs an affine matrix of the geometry convolution network,

is a physical feature input to the geometry convolution network, W^g∈R^d×dIs an adaptive parameter matrix of a geometric convolution network, sigma is a nonlinear activation function, norm is a normalization function, X^gIs the output of the geometry convolution network.

At the same time, the output X of the convolution network for the geometry map^gReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;

in the appearance graph convolution network, two different linear transformations are first applied to the entity features input to the appearance graph convolution network, and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix A^aRescaling with softmax for each line, the formula is as follows:

wherein, X_iDenotes the i-th entity feature, X_jDenotes the jth entity characteristic, φ (X)_i)^TIndicates that the ith entity characteristic is transposed after being linearly transformed, phi' (X)_j) Represents another linear transformation to the jth entity feature, exp () represents an exponential function with a natural constant e as the base, N represents the number of entity features,

the values in the ith row and jth column of the appearance dependency matrix are shown.

The appearance graph convolution network calculation formula is as follows:

X^a＝norm(σ(A^aXW^a))

wherein, W^aIs an adaptive parameter matrix, X, of the appearance graph convolution network^aIs the output of the appearance graph convolution network.

Then, the output X of the network is convolved with the appearance map as well^aReLU activation and Layer-Norm are performed such that the appearance map is the output of the convolutional networkThe input and output dimensions remain consistent.

In a preferred embodiment of the invention, the output X of the geometry convolution network is the entity feature X input to the spatio-temporal convolution network module^gAnd the output X of the appearance convolution network^aAdding according to the following formula:

X′＝norm(σ(X^a+X+X^g))

and performing ReLU activation and normalization on the calculation output result X', and inputting the result into the next space-time graph convolution network module.

In a preferred embodiment of the present invention, in the step 3), the feature map Z of the entity in the current segment obtained in the step 2) is input into the linear transformation layer and the softmax layer to obtain the vector V for predicting the entity classification^oThe formula is as follows:

V_i ^o＝softmax(φ^o(Z_i))(i∈[1,N])

wherein Z is_iFeature vectors of the ith row in the table feature map Z; phi is a^o(Z_i) Represents a pair Z_iPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), V_i ^oRepresents a vector V^oThe ith element in (1).

In a preferred embodiment of the present invention, in the step 3), every two feature vectors in the feature map Z of the entity in the current segment obtained in the step 2) are paired to form a new feature vector<Subject, object>Dimension (N x (N-1), 2 d); obtaining a relative motion profile Z^rm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of

Splicing to generate a feature map Z ' with dimensions of (N x (N-1), 2d + d '), and generating a vector V of the predicted predicate distribution by the feature map Z ' through a linear transformation layer and a sigmoid layer^pThe formula is as follows:

wherein phi is^pRepresenting a linear transformation, sigmoid () representing a sigmoid layer, Z_i，Z_jRepresenting the ith and jth row vectors in the feature map Z,

characteristic diagram Z of relative motion^rmThe ith row and the jth column of elements,

represents Z_i，Z_j，

The three are spliced end to end.

Example 1

The video visual relationship data set VidVRD is used for testing the video relationship detection capability of the method. The VidVRD dataset contains a total of 1000 videos that are well labeled with object categories and corresponding trajectories. Visual relationships are labeled under 35 object classes and 132 predicate classes, labeled < subject, predicate, object >. Fig. 1 illustrates examples of VidVRD video visual relationship data, one visual relationship example represented by the relationship triplet < subject, predicate, object > and trajectory of subject and object.

The steps performed in this example are described below in conjunction with the specific technical solutions described above, as follows:

1) first 80% of the VidVRD data set was used as the pre-acquired training data set and the remaining 20% of the video data was used as the test video data. Dividing each video into a plurality of segments, each segment having 30 frames;

2) using the fine-tuned Faster R-CNN as the entity detector, generating entity detection frames on each frame, and connecting the frame-level entity frames in each segment to generate entity track features. Setting a threshold value of the vIoU, reducing similar tracks, and using the previous N tracks as the input of the association network, wherein N is set to be 5;

3) and taking three adjacent segments (the previous segment, the current segment and the next segment), and splicing the characteristics of the previous segment and the current segment and the characteristics of the current segment and the next segment respectively to be used as two branches to be input into a fully-connected space-time graph convolution network module. Each space-time graph convolution network module is formed by calculating a geometric graph convolution network and an appearance graph convolution network;

4) in the geometric convolution network, the value of the vIoU is taken as a value in an affine matrix, and the calculation formula of the geometric convolution network is as follows:

X^g＝norm(σ(A^gXW^g)) (1)

wherein A is^gIs an affine matrix of the geometry convolution network,

Each row of the affine matrix is then normalized by the manhattan norm as follows:

5) in an appearance graph convolution network, two different linear transformations are first applied to entity features and then multiplied to obtain appearance correlation values, which constitute an appearance correlation matrix A^aRescaling with softmax for each line, the formula is as follows:

the appearance graph convolution network calculation formula is as follows:

X^a＝norm(σ(A^aXW^a)) (4)

Then, the output X of the network is convolved with the appearance map as well^aReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.

6) For original feature X, output X of the geometry convolution network^gAnd the output X of the appearance convolution network^aAdding according to the following formula:

X′＝norm(σ(X^a+X+X^g)) (5)

7) Extracting the characteristic diagram of the entity in the current segment from the output of the two branches in an element addition mode

On the one hand, Z is input into a linear transformation layer and a softmax layer to obtain a vector V for predicting object classification^oThe formula is as follows:

V_i ^o＝softmax(φ^o(Z_i))(i∈[1,N]) (6)

on the other hand, every two feature vectors in Z can be paired to form a new one<Subject, object>The dimension is (N x (N-1), 2 d). Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of

Splicing to generate a feature map with dimensions of (N x (N-1), 2d + d'), and generating a feature vector V of the predicted predicate distribution by the feature map through a linear transformation layer and a sigmoid layer^pThe formula is as follows:

finally, each relationship instance triplet<Subject, predicate, object>V of^oAnd V^pMultiplying the confidence coefficient scores, and acquiring n relation examples with the highest scores as the input of the association network for each segment;

8) feature vectors of subjects or objects from two adjacent segments are input into an embedded network, and then a confidence score alpha of appearance similarity of the two entities is calculated through a cosine similarity function, wherein the formula is as follows:

wherein,

and

is any two tracks in a continuous segment;

9) in order to simultaneously consider the geometric information and the appearance information, the vIoU and the confidence coefficient alpha are multiplied by the corresponding weight value and then added to obtain the final associated confidence coefficient score

The formula is as follows:

10) the set of all short-term relationship instances in the segment corresponding to the current time T is

Wherein,

and

and

respectively, the trajectory of the entity corresponding to the subject and the trajectory of the entity corresponding to the object in the long-term instance. To the collection

And collections

Push button

And c sorting in descending order.

And inner loop traversal set

For short-term relationship examples

Belong to a set

And long term relationship examples

Belong to a set

And

and

and (4) calculating association confidence scores according to the steps (8) to (9), and only when the triples corresponding to the short-term relationship example and the long-term relationship example are the same and the two association confidence scores are both greater than a threshold value y, combining the two associations. For long-term relationship instances from mth segment to nth segment

c_p＝max(c^t)(t∈[m,n]) (10)

11) after the detection results are sorted in descending order of the confidence score, Recall @ K (K is 50 or 100) is used as an evaluation index of video visual relationship detection, and the evaluation index represents the proportion of correct video visual relationship examples detected in the former K detection results. The highest 5 results were obtained to evaluate the accuracy of the results.

In an example of implementation, the test results of the method of the invention are compared with the reference results provided by the VidVRD data set. The results are shown in FIGS. 2 to 4. Fig. 2 is a curve that the accuracy of using VRD-GCN on a VidVRD dataset varies with training epochs, the accuracy gradually increases with the increase of the number of training epochs, and tends to be stable after 25 epochs, fig. 3 shows an algorithm iteration convergence curve, as the number of training epochs increases, a loss function value gradually decreases, the decrease range is slow after 25 epochs, and the loss value is close to 0, fig. 4 shows that the detection result of the VRD-GCN video relationship is compared with a reference result, for the same video segment, the detection result of the invention is richer and more accurate than that of the VidVRD of the reference method.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A video relation detection method based on a space-time diagram is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step 1) comprises:

3. The method for detecting video relationship based on space-time diagram according to claim 2, wherein after generating the entity track feature in step 1), the method further comprises: and setting a vIoU threshold value, and removing the entity track below the threshold value.

4. The spatio-temporal graph-based video relationship detection method according to claim 1, wherein the spatio-temporal graph convolution network module in the step 2) is composed of a geometry graph convolution network and an appearance graph convolution network;

in a geometric graph convolution network, a vIoU value is used as a value in an affine matrix, and then each row of the affine matrix is normalized by a Manhattan norm; output X for geometry convolution network^gReLU activation and Layer-Norm are carried out, so that the input and output dimensions of the geometric graph convolution network are kept consistent;

applying two different linear transformations to the real input to the appearance graph convolution networkThe volume features are then multiplied to obtain appearance correlation values, which make up an appearance correlation matrix A^aRescaling with softmax for each line; output X to the appearance graph convolutional network^aReLU activation and Layer-Norm are performed so that the input-output dimension of the appearance graph convolution network is kept consistent.

5. The spatio-temporal graph-based video relationship detection method according to claim 4, wherein, for the entity feature X input to the spatio-temporal convolution network module, the output X of the geometric convolution network^gAnd the output X of the appearance convolution network^aAdding according to the following formula:

X′＝norm(σ(X^a+X+X^g))

6. The method according to claim 1, wherein in step 3),

inputting the feature graph Z of the entity in the current segment obtained in the step 2) into a linear transformation layer and a softmax layer to obtain a vector V for predicting entity classification, wherein the formula is as follows:

V_i ^o＝softmax(φ^o(Z_i)) (i∈[1,N])

wherein Z is_iFeature vectors of the ith row in the table feature map Z; phi is a^o(Z_i) Represents a pair Z_iPerforming linear transformation; the dimension of the characteristic diagram Z is (N, d), V_i ^oRepresenting the i-th element in the vector V deg..

7. The method according to claim 1, wherein in step 3),

pairing every two eigenvectors in the characteristic diagram Z of the entity in the current segment obtained in the step 2) to form a new eigenvector<Subject, object>Dimension (N x (N-1), 2 d); obtainingRelative motion profile Z^rm(ii) a Then this is<Subject, object>Characteristic diagram and relative motion characteristic diagram of

represents Z_i，Z_j，

The three are spliced end to end.

8. The method for detecting video relationship based on spatio-temporal graph according to claim 1, wherein in the step 4), the association module with twin network processes as follows:

where emb () represents the embedded network,

a function representing the degree of similarity of the cosine,

and

is any two tracks in two adjacent segments,

and

are respectively a track

And

the features of (1);

The formula is as follows:

Wherein,

is the confidence score of the short-term instance, the vector V DEG of the predicted entity classification and the vector V of the predicted predicate distribution^pAs a result of the multiplication, the result of the multiplication,