CN115497164A

CN115497164A - Multi-view framework sequence fusion method based on graph convolution

Info

Publication number: CN115497164A
Application number: CN202211157830.6A
Authority: CN
Inventors: 冯伟; 孟繁博
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-20

Abstract

The invention relates to a multi-view framework sequence fusion method based on graph convolution, which comprises the following steps: performing data enhancement, and adjusting the data into a multi-view framework sequence as the input of a multi-branch network; extracting time domain graph integration characteristics of each view angle from each branch by using a time-space graph convolution network, extracting partial end points and connection points from a human body skeleton as joint points, and performing space dimension segmentation to obtain time domain graph integration representation of each joint point; constructing a multi-view fusion graph by combining the natural topological relation of the human body, the corresponding relation of joint points among views and the graph integration characteristics of the joint points in an integration space; carrying out graph convolution according to the multi-view fusion graph, fusing the features, and obtaining the multi-view fusion graph integration feature; multi-branch joint prediction.

Description

Multi-view framework sequence fusion method based on graph convolution

Technical Field

The invention belongs to the field of artificial intelligence and computer vision, relates to a feature fusion technology, and particularly relates to a multi-view skeleton sequence fusion method based on graph convolution.

Background

Multiple cameras can simultaneously capture the same action performer from different perspectives, thereby providing supplemental information for many important visual tasks (e.g., human-computer interaction), etc. In this case, one important issue is multi-view complementation, which aims to use a multi-camera system to complement the occlusion and absence that may occur at a single view.

The background art related to the invention is as follows:

(1) Adaptive graph convolution network (reference [1 ]): most of the existing work typically uses a predefined graph for graph convolution. However, the natural connection relation of the human body is not necessarily the most suitable edge in the behavior recognition task, and in addition, the neural network is layered, and the optimal graph corresponding to the convolution of each layer of graph is not necessarily consistent, so the invention combines the natural topology relation of the human body, the corresponding relation of the joint points among the visual angles and the similarity of the joint points in the integrated space characteristics to carry out the self-adaptive graph convolution.

(2) Feature fusion (reference [4 ]): the existing feature fusion mode mostly follows the following procedures, firstly, the features to be fused are preprocessed, and feature spaces are unified; and then, performing fusion by adopting a general fusion strategy such as maximum fusion, averaging fusion, bilinear pooling and the like. Although this flow has versatility, it ignores the spatial information of the skeleton sequence. Therefore, on the basis that the decision layer continues to use the traditional flow, the invention introduces a graph convolution network in the middle layer and combines spatial information, namely joint-level information for fusion.

(3) Multiobjective optimization (reference [2 ]): most multi-view motion recognition methods employ a multi-branch architecture. Each branch takes the skeleton sequence under the appointed view angle as input, and combines and optimizes the loss functions of different branches. The grid search aims to find a fixed optimal weight for each penalty, but it is very migratory. Another widely used strategy is to automatically weight the penalties, the weight of each penalty being optimized together by learning. But this is independent of the task and is more suitable for losses of different scales. Furthermore, data loss is common for certain views, which can lead to imbalance of different branches. Therefore, the present invention proposes a multi-view data enhancement method and a deviation-based weighted multi-loss function, thereby solving the above problems.

(4) The similarity calculation mode is as follows: similarity between two targets is often evaluated in machine learning by measuring the distance of samples. Common similarity measurement methods include euclidean distance, cosine similarity, hamming distance, manhattan distance, and the like. The present invention uses Euclidean distance as a measure of similarity between integrated spatial features. Thereby constructing a similarity adjacency matrix.

Reference to the literature

[1]Shi L,Zhang Y,Cheng J,et al.Skeleton-based action recognition with multi-stream adaptive graph convolutional networks[J].IEEE Transactions on Image Processing,2020,29:9532-9545.

[2]Alex Kendall,Yarin Gal,and Roberto Cipolla,“Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,”in Proceedings of the IEEE conference on computer vision and pattern recognition,2018,pp.7482–7491.

[3]Le Zhang,Zenglin Shi,Ming-Ming Cheng,Yun Liu,Jia-Wang Bian,Joey Tianyi Zhou,Guoyan Zheng,and Zeng Zeng,“Nonlinear regression via deep negative correlation learning,”IEEE transactions on pattern analysis and machine intelligence,2019.

[4]Christoph Feichtenhofer,Axel Pinz,and Andrew Zis-serman,“Convolutional two-stream network fusion for video action recognition,”in Proceedings of the IEEE conference on computer vision and pattern recognition,2016,pp.1933–1941.

[5]Azis N A,Jeong Y S,Choi H J,et al.Weighted averaging fusion for multi-view skeletal data and its application in action recognition[J].IET Computer Vision,2016,10(2):134-142.

Disclosure of Invention

The invention aims to provide a multi-view framework sequence fusion method capable of improving identification precision. The technical scheme is as follows:

a multi-view skeleton sequence fusion method based on graph convolution comprises the following steps:

step one, enhancing data, adjusting the data into a multi-view framework sequence, and using the multi-view framework sequence as the input of a multi-branch network;

(1) Combining framework sequences of the single action sequence under all the visual angles into a multi-visual angle framework sequence;

(2) Rearranging the obtained multi-view framework sequence at a view level;

(3) The multi-view skeleton sequence rearranged in view level is used as the input of the multi-branch network, and each branch receives all views of the sample at different time.

Step two, extracting the time domain graph integration characteristic X of each visual angle by using a time-space graph convolution network at each branch ₁ ,X ₂ ,...,X _n ；

(1) Selecting a space-time graph convolution network with the same structure for each branch;

(2) Extracting the time domain graph integration feature X of the input visual angle by using the selected time-space graph convolution network in each branch, wherein the method comprises the following steps: inputting a view-level rearranged multi-view skeleton sequence at each branch using a selected space-time graph convolutional network

Wherein c is ₁ The number of channels of the framework sequence is t, the number of frames of the framework sequence is the time dimension, and v is the number of joint points of the framework sequence is the space dimension; averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristic X of each view angle ₁ ,X ₂ ,...,X _n Time domain graph integration features forming the whole sequence

Wherein c is ₂ Integrating the number of channels of the characteristic X for the time domain graph;

(3) Extracting partial end points and connecting points from a human skeleton as joint points, and obtaining a time domain graph integrated representation X of each joint point by space dimension segmentation _:,j ；

Thirdly, constructing a multi-view fusion graph M by combining the natural topological relation of the human body, the corresponding relation of the joint points among the views and the graph integration characteristics of the joint points in the integration space;

(1) The connection between the joint points is the natural topological relation of the human body; constructing an adjacency matrix A by combining the natural topological relation of the human body and the corresponding relation of the joint points between the visual angles;

(2) Performing Laplace transform on the adjacency matrix A to obtain a natural connection diagram N;

(3) Constructing a corresponding similarity graph R according to the extracted graph integration representation of each joint point;

(4) The natural connection graph N and the similarity graph R are stored in the form of an adjacent matrix, matrix element values represent the strength of connection between nodes, the natural connection graph N and the similarity graph R are subjected to matrix level normalization respectively, then elements at corresponding positions are subjected to weighted summation, and the natural connection graph N and the similarity graph R are subjected to weighted fusion to obtain a multi-view fusion graph M;

fourthly, carrying out graph convolution according to the multi-view fusion graph M, and fusing the characteristics X ₁ ,X ₂ ,…,X _n Obtaining a multi-view fused map integration feature X _n+1 The method comprises the following steps:

(1) Integrating the extracted time domain graph of each view angle into a characteristic X ₁ ,X ₂ ,...,X _n Splicing in space dimension, and constructing a feature vector X containing joint points under all visual angles _n+1 ；

(2) Will contain the feature vector X of the joint point under all view angles _n+1 Inputting the graph M into a graph convolution network to carry out space-view graph convolution to obtain a multi-view fused graph integration characteristic Y _n+1 ；

Step five, multi-branch combined prediction is carried out, and the method comprises the following steps:

integrating the obtained multi-view fused graph into a feature Y _n+1 And the extracted time domain map of each view integrates the feature X ₁ ,X ₂ ,...,X _n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified initial prediction

Wherein class is the number of classes of the sample;

(2) Mapping the obtained classified preliminarily predicted data elements between (0, 1) by using a sigmoid function;

(3) Performing decision layer fusion on the mapped classified preliminary prediction in a mode of taking the maximum value as a final prediction result;

and step six, if the loss function corresponding to the end-to-end network framework is not converged, repeating the iteration steps from the second step to the fifth step.

Further, the loss function corresponding to the end-to-end network framework is specifically:

(1) Utilizing a cross entropy loss function to constrain the difference between the predicted result and the true value of each branch, and taking the difference as the loss function of each branch;

(2) The training samples of each branch are the same, so the commonly used automatic weighting loss function is not applicable; taking the reciprocal of each branch prediction variance as the weight of the branch loss function;

(5) The loss function of the whole network is the weighted sum of the loss functions of all the branches.

The technical scheme provided by the invention has the following beneficial effects:

1. when multi-view input is fused for carrying out skeleton-based behavior recognition, the method provides a space-view graph convolution application from the point of time-space graph convolution by utilizing the corresponding relation of joint points under different views in a graph integration space. Compared with the traditional fusion mode, the method is more intuitional, and the effect of the multi-view fusion characteristics obtained at the same time for prediction is better. In addition, the time-space graph convolution network integrated by the extraction graph can be freely selected, and the flexibility of the model is greatly enhanced.

2. The invention provides a multi-view data enhancement strategy and a matched loss function, and improves the overall accuracy of a model on the basis of solving the problem of unbalanced branch input of a traditional multi-branch model.

3. The invention provides an end-to-end training multi-branch graph neural network model, which utilizes a graph model to fuse all view angle characteristics and solves the problem of multi-view angle behavior identification based on a framework.

Drawings

FIG. 1: the corresponding relation (right graph) of the natural topological relation (left graph) of the human body and the joint points between the visual angles;

FIG. 2: identifying a problem case graph for the skeleton-based multi-view behavior;

FIG. 3: the method is a multi-time behavior identification method diagram based on a framework;

FIG. 4: the method is a flow chart of a skeleton-based multi-time behavior identification method;

FIG. 5: the detail map is rolled up for the multi-view map.

Detailed Description

Skeletal-based behavior recognition plays an important role in many applications of computer vision. The invention researches and fuses skeleton sequences under multiple visual angles shot by different cameras at the same time to identify behaviors, namely a fusion method of multi-visual angle skeleton sequences. The invention provides a novel multi-view framework sequence fusion method based on a graph convolution network to solve the problem. Firstly, data enhancement processing is carried out, and the view level full arrangement of the multi-view bone sequence is used as the input of the model. And extracting time domain graph integration characteristics of each joint point under each view angle under a single view angle by using a time-space graph convolution network, and compressing the time dimension while preserving the space dimension, namely the joint point. And then, a natural connection graph is established by utilizing the natural topological relation of the human body, the corresponding relation graph is calculated based on the integration according to the corresponding relation of the joint points among the visual angles, and the multi-visual-angle graph convolution is carried out by combining the natural connection graph and the integrated relation graph to obtain the multi-visual-angle fusion graph integration characteristic. And optimized for prediction in conjunction with the map integration features at a single view angle. In view of the features of this task, the present invention also proposes a data-driven bias-based loss function to improve the training and prediction performance of the model. The fusion method improves the prediction precision of the existing skeleton-based action recognition method and obtains good cross-domain performance.

According to the multi-view skeleton sequence fusion method based on graph convolution, the recognition accuracy is improved by fusing the representation of the same action sequence under multiple views at different stages, and as shown in FIG. 2, skeleton sequences from different views are fused and are commonly used for behavior recognition.

The invention models the multi-view action recognition problem of any view angle number based on a framework into a multi-objective optimization problem, combines a multi-branch network, and provides a novel fusion method based on graph convolution to solve the problem, wherein the flow design is shown in figure 4. The overall structure of the model is shown in fig. 3, and only the situation of two view angles is shown for the purpose of simplicity, and the situation of more view angles can be expanded on the basis. Extracting time domain graph integration characteristics of each view angle from the first half part of the network through a time-space graph convolution network; and in the latter half of the network, self-adaptive graph convolution is constructed by combining the natural topological relation of the human body, the corresponding relation of joint points among visual angles and the similarity of the joint points in an integration space, and the integrated characteristics of the multi-visual-angle fused graph are obtained and are optimized in a combined manner with the integrated characteristics of the single-visual-angle graph, so that the behavior identification problem based on the multi-visual-angle skeleton sequence is better solved. The invention can be used for the action sequence shot by a plurality of cameras in the same scene, and can simultaneously fuse the skeleton sequences under different visual angles in multiple stages under the condition of not limiting the number of visual angles, namely, the multi-visual angle skeleton sequence fusion.

The technical scheme of the invention is given below. The flow chart is shown in fig. 4.

And step s1, performing data enhancement on the single-view-angle skeleton sequence to be used as the input of a multi-branch network.

Step s2, extracting the time domain graph integration characteristic X of each visual angle by using a time-space graph convolution network in each branch ₁ ,X ₂ ,...,X _n 。

And step s3, constructing a multi-view fusion graph M by combining the topological relation of the human body, the corresponding relation of the joint points among the views and the similarity of the joint points in the integration space.

Step s4, carrying out multi-view image convolution according to the multi-view fusion image B to obtain multi-view fusion image integration characteristic X _n+1 。

Step s5, combining the single visual angle characteristic X ₁ ,X ₂ ,…,X _n And multi-view fusion feature X _n+1 And performing joint prediction.

And s6, if the loss function corresponding to the end-to-end network framework is not converged, repeating the iteration steps from the second step to the fifth step.

The multi-view framework sequence fusion method based on graph convolution comprises the following specific implementation steps:

data enhancement:

before training, the single-view skeleton sequence is first adjusted into a multi-view skeleton sequence. The method comprises the following specific steps:

(1) And combining the skeleton sequences of the single action sequence under all the visual angles into a multi-visual angle skeleton sequence.

(2) And rearranging the obtained multi-view framework sequences at a view level.

Description 1: view angle level rearrangement

The data used by the invention comprises the skeleton sequences which are shot and acquired from a plurality of visual angles at the same time by the same action sequence, and the input branch of the single-visual angle skeleton sequence is changed on the premise of keeping the original time sequence and the original null sequence of the skeleton sequences unchanged.

(3) The full arrangement of the multi-view skeleton sequence in the view dimension is used as the input of the network, that is, each branch receives all views of the sample at different time as the input, so as to balance the input of each branch.

(II) extracting the time domain graph integration characteristics of a single view: extracting time domain graph integration characteristic X of each view angle by using a time-space graph convolution network at each branch ₁ ,X ₂ ,...,X _n

The specific method for extracting the time domain graph integration features comprises the following steps:

(1) And selecting a time-space diagram convolution network with the same structure for each branch.

Description 2: selection of a space-time graph convolutional network

The space-time graph convolutional network can be regarded as a feature extraction network, and the mainstream method in the field of behavior recognition based on a framework at present is a space-time graph convolutional neural network. And (3) regarding the skeleton sequence as a space-time diagram comprising time connection and space connection, and alternately performing time dimension convolution and space dimension convolution. The graph convolution neural network is used as a space-time graph convolution network, and reserved dimensions and specifications can be selected autonomously.

(2) Inputting single-view skeleton sequence at each branch by using selected space-time graph convolution network

Wherein c is ₁ The number of channels of the skeleton sequence (actually meaning two-dimensional or three-dimensional coordinates of the joint points), t is the frame number of the skeleton sequence, i.e. the time dimension, and v is the number of joint points of the skeleton sequence, i.e. the space dimension. Averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristics of the whole sequence

Wherein c is ₂ The number of channels of feature X is integrated for the time domain plot.

(3) In the space dimension division, obtaining the time domain graph integration representation X of each joint point _:,j 。

Description 3: graph-integrated representation of joint points

The space dimension of the extracted features of the space-time graph convolutional network is consistent with the specification of the original skeleton sequence, and the time dimension is compressed, so that the extracted feature vector of the designated position j can be regarded as the time graph integration of the joint point.

(III) acquiring a multi-view fusion map M: constructing a multi-view fusion graph M by combining the natural topological relation of human bodies, the corresponding relation of joint points among views and the graph integration characteristics of the joint points in an integration space

The specific method for calculating the multi-view fusion map M through the natural connection relationship and the joint point similarity is shown in fig. 5:

(1) Partial end points and connection points are extracted from the human skeleton to serve as joint points, the connection among the joint points is the natural topological relation of the human body, and the same joint points under different visual angles also have corresponding relation. And constructing an adjacency matrix A by combining the natural topological relation of the human body and the corresponding relation of the joint points between the visual angles.

(2) And performing Laplace transform on the adjacency matrix A to obtain a natural connection diagram N.

(3) And constructing a similarity graph R between the joint points according to the extracted graph integration representation of each joint point.

(4) And normalizing the natural connection graph N and the similarity graph R, and then performing weighted fusion to obtain a multi-view fusion graph M.

Description 4: weighted fusion of graphs

The natural connection graph N and the similarity graph R are stored in a form of an adjacent matrix as in a reference document [1], matrix element values represent the strength of connection between nodes, matrix-level normalization is respectively carried out on the natural connection graph N and the similarity graph R, and then element weighted summation of corresponding positions is carried out. The weights are trainable parameters shared by the whole matrix, the natural connection graph N and the similarity graph R respectively share one weight on the whole graph, and the sum of the weights of the natural connection graph N and the similarity graph R is 1, so that the graph is endowed with greater flexibility.

(IV) fusing the single-view characteristics: performing graph convolution according to the multi-view fusion graph M, and fusing the features X ₁ ,X ₂ ,…,X _n Obtaining a multi-view fused map integration feature X _n+1

The method for fusing the integrated features of the single view map comprises the following steps:

(1) Extracting the integrated feature X ₁ ,X ₂ ,...,X _n Splicing in space dimension, and constructing a feature vector X containing joint points under all visual angles _n+1 。

(2) For the purpose of suppressing the over-smoothing trend of the map, the output in (1), namely the feature vector X containing the joint points under all the view angles _n+1 Inputting the multi-view fusion graph M into a graph volume block, and performing space-view graph convolution by using the space-time graph convolution network structure for extracting the features to obtain the multi-view fusion graph integration feature Y _n+1 。

Description 5: spatio-visual angle diagram convolution

After the temporal dimension is compressed, its position is replaced by the view dimension. The original skeleton sequence can be regarded as a complete time-space diagram, and in this case, can be regarded as a space-view diagram, which is different in that the space coordinates corresponding to the joint points of the original image are replaced by a diagram integration representation. Consider a spatio-temporal graph convolution performed in the same way as a spatio-temporal graph convolution. Since the number of nodes is small at this time, the entire system can be directly convolved.

(V) Multi-branch Joint prediction, optimization

The joint prediction and optimization method of the single-view branch and the fusion multi-view branch comprises the following steps:

(1) Resulting multi-view fused graph integration feature Y _n+1 And extracted integration feature X ₁ ,X ₂ ,...,X _n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified initial prediction

Where class is the number of classes of the sample.

(2) The resulting classified preliminary predicted data elements are mapped between (0, 1) using the sigmoid function.

(3) Performing decision layer fusion on the mode of taking the maximum value of the primary prediction of the mapped classification as the final prediction resultSay thatBright 6: decision level fusion

The fusion of multi-branch models can be divided into data fusion, feature fusion, model fusion and decision layer fusion (newly added references [ 5 ]) according to processing stages, and the invention adopts decision layer fusion.

The prediction results of the branches belong to the same feature space, so that a plurality of traditional fusion modes can be adopted, and reference 4 compares the traditional fusion modes at different levels of the model.

The invention adopts maximum value fusion, and takes the maximum value of the predicted value of the same category of each branch as output.

Description 7: loss function construction

Step 1: a single branch penalty function. Consider step (one) by rearranging the input data such that the branches are trained in different orders on the same data. Therefore, a conventional cross entropy loss function is employed:

wherein

Representing the output result of the network, and the value range of the value is (0, 1); y represents the true predicted value and can only take 0 or 1.

Step 2: a loss function based on the deviation. Reference 3 proposes that the generalization error of the integrated model depends on the covariance between submodels, provided that the deviation is smaller than that of the submodels. Since each branch can be predicted independently, the multi-branch model can be viewed as an integrated model that integrates the various branches. Although the same training data reduces inter-branch covariance, it is also possible to place the ensemble model in local optimality, so the invention constructs bias-based penalty functions.

Assume an input of m view angles, where w _i The weight of branch i, the sum of the branch weights is 1. When i is equal to [1,m ]]When the temperature of the water is higher than the set temperature,

is the prediction result of the single-view branch; when i = m +1, the number of the terminals is increased,

is the predicted result of the multi-time fusion branch. y represents the true predictor. Weight w _i Inversely correlating the bias of each branch, using a variance loss function to measure the bias of the branch, such that

In this way, the branches with higher fitting degree have larger learning rate, and simultaneously effectively restrain the change in other branches. On the premise of ensuring basic accuracy, the loss function based on the deviation can enable each branch to be converged to the global optimum more easily, so that overfitting of the model is preventedAnd finally, the overall accuracy of the model is improved.

(VI) judging whether the model is trained completely

The specific method for judging whether the model is trained completely comprises the following steps:

in the training process of the neural network, whether the training can be stopped can be judged according to the loss function value. The training may be stopped when the loss function drops to a certain extent substantially unchanged.

Claims

1. A multi-view skeleton sequence fusion method based on graph convolution comprises the following steps:

(2) Rearranging the obtained multi-view framework sequence at a view level;

(1) Selecting a time-space diagram convolution network with the same structure for each branch;

(2) Extracting the time-domain graph integration feature X of the input visual angle by using the selected time-space graph convolution network in each branch, wherein the method comprises the following steps: inputting the multi-view skeleton sequence rearranged by view level at each branch by using the selected space-time diagram convolution network

Wherein c is ₁ The number of channels of the skeleton sequence is t, the frame number of the skeleton sequence is the time dimension, and v is the joint point number of the skeleton sequence is the space dimension; averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristic X of each view angle ₁ ,X ₂ ,...,X _n Time domain graph integration features forming the whole sequence

(3) Extracting partial end points and connection points from a human skeleton as joint points, and performing space dimension segmentation to obtain a time domain graph integrated representation of each joint point;

(2) Carrying out Laplace transform on the adjacency matrix A to obtain a natural connection diagram N;

(4) The natural connection graph N and the similarity graph R are stored in the form of an adjacent matrix, matrix element values represent the strength of connection between nodes, matrix-level normalization is respectively carried out on the natural connection graph N and the similarity graph R, then element weighting summation is carried out on corresponding positions, and the natural connection graph N and the similarity graph R are subjected to weighting fusion to obtain a multi-view fusion graph M;

fourthly, graph convolution is carried out according to the multi-view fusion graph M, and the features X are fused ₁ ,X ₂ ,…,X _n Obtaining a multi-view fused map integration feature X _n+1 The method comprises the following steps:

(1) Integrating the extracted time domain map of each view into a feature X ₁ ,X ₂ ,...,X _n Splicing in space dimension, and constructing a feature vector X containing joint points under all view angles _n+1 ；

(2) Will contain the feature vector X of the joint point under all the view angles _n+1 Inputting the graph M into a graph convolution network to carry out space-view graph convolution to obtain a multi-view fused graph integration characteristic Y _n+1 ；

integrating the obtained multi-view fused graph into a feature Y _n+1 And the extracted time domain map of each view integrates the feature X ₁ ,X ₂ ,...,X _n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified preliminary prediction

Wherein class is the number of classes of the sample;

2. The multi-view skeleton sequence fusion method of claim 1, wherein the loss function corresponding to the end-to-end network frame is specifically:

(3) The loss function of the whole network is the weighted sum of the loss functions of all the branches.