CN115497164A - Multi-view framework sequence fusion method based on graph convolution - Google Patents

Multi-view framework sequence fusion method based on graph convolution Download PDF

Info

Publication number
CN115497164A
CN115497164A CN202211157830.6A CN202211157830A CN115497164A CN 115497164 A CN115497164 A CN 115497164A CN 202211157830 A CN202211157830 A CN 202211157830A CN 115497164 A CN115497164 A CN 115497164A
Authority
CN
China
Prior art keywords
graph
view
branch
fusion
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211157830.6A
Other languages
Chinese (zh)
Inventor
冯伟
孟繁博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202211157830.6A priority Critical patent/CN115497164A/en
Publication of CN115497164A publication Critical patent/CN115497164A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-view framework sequence fusion method based on graph convolution, which comprises the following steps: performing data enhancement, and adjusting the data into a multi-view framework sequence as the input of a multi-branch network; extracting time domain graph integration characteristics of each view angle from each branch by using a time-space graph convolution network, extracting partial end points and connection points from a human body skeleton as joint points, and performing space dimension segmentation to obtain time domain graph integration representation of each joint point; constructing a multi-view fusion graph by combining the natural topological relation of the human body, the corresponding relation of joint points among views and the graph integration characteristics of the joint points in an integration space; carrying out graph convolution according to the multi-view fusion graph, fusing the features, and obtaining the multi-view fusion graph integration feature; multi-branch joint prediction.

Description

Multi-view framework sequence fusion method based on graph convolution
Technical Field
The invention belongs to the field of artificial intelligence and computer vision, relates to a feature fusion technology, and particularly relates to a multi-view skeleton sequence fusion method based on graph convolution.
Background
Multiple cameras can simultaneously capture the same action performer from different perspectives, thereby providing supplemental information for many important visual tasks (e.g., human-computer interaction), etc. In this case, one important issue is multi-view complementation, which aims to use a multi-camera system to complement the occlusion and absence that may occur at a single view.
The background art related to the invention is as follows:
(1) Adaptive graph convolution network (reference [1 ]): most of the existing work typically uses a predefined graph for graph convolution. However, the natural connection relation of the human body is not necessarily the most suitable edge in the behavior recognition task, and in addition, the neural network is layered, and the optimal graph corresponding to the convolution of each layer of graph is not necessarily consistent, so the invention combines the natural topology relation of the human body, the corresponding relation of the joint points among the visual angles and the similarity of the joint points in the integrated space characteristics to carry out the self-adaptive graph convolution.
(2) Feature fusion (reference [4 ]): the existing feature fusion mode mostly follows the following procedures, firstly, the features to be fused are preprocessed, and feature spaces are unified; and then, performing fusion by adopting a general fusion strategy such as maximum fusion, averaging fusion, bilinear pooling and the like. Although this flow has versatility, it ignores the spatial information of the skeleton sequence. Therefore, on the basis that the decision layer continues to use the traditional flow, the invention introduces a graph convolution network in the middle layer and combines spatial information, namely joint-level information for fusion.
(3) Multiobjective optimization (reference [2 ]): most multi-view motion recognition methods employ a multi-branch architecture. Each branch takes the skeleton sequence under the appointed view angle as input, and combines and optimizes the loss functions of different branches. The grid search aims to find a fixed optimal weight for each penalty, but it is very migratory. Another widely used strategy is to automatically weight the penalties, the weight of each penalty being optimized together by learning. But this is independent of the task and is more suitable for losses of different scales. Furthermore, data loss is common for certain views, which can lead to imbalance of different branches. Therefore, the present invention proposes a multi-view data enhancement method and a deviation-based weighted multi-loss function, thereby solving the above problems.
(4) The similarity calculation mode is as follows: similarity between two targets is often evaluated in machine learning by measuring the distance of samples. Common similarity measurement methods include euclidean distance, cosine similarity, hamming distance, manhattan distance, and the like. The present invention uses Euclidean distance as a measure of similarity between integrated spatial features. Thereby constructing a similarity adjacency matrix.
Reference to the literature
[1]Shi L,Zhang Y,Cheng J,et al.Skeleton-based action recognition with multi-stream adaptive graph convolutional networks[J].IEEE Transactions on Image Processing,2020,29:9532-9545.
[2]Alex Kendall,Yarin Gal,and Roberto Cipolla,“Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,”in Proceedings of the IEEE conference on computer vision and pattern recognition,2018,pp.7482–7491.
[3]Le Zhang,Zenglin Shi,Ming-Ming Cheng,Yun Liu,Jia-Wang Bian,Joey Tianyi Zhou,Guoyan Zheng,and Zeng Zeng,“Nonlinear regression via deep negative correlation learning,”IEEE transactions on pattern analysis and machine intelligence,2019.
[4]Christoph Feichtenhofer,Axel Pinz,and Andrew Zis-serman,“Convolutional two-stream network fusion for video action recognition,”in Proceedings of the IEEE conference on computer vision and pattern recognition,2016,pp.1933–1941.
[5]Azis N A,Jeong Y S,Choi H J,et al.Weighted averaging fusion for multi-view skeletal data and its application in action recognition[J].IET Computer Vision,2016,10(2):134-142.
Disclosure of Invention
The invention aims to provide a multi-view framework sequence fusion method capable of improving identification precision. The technical scheme is as follows:
a multi-view skeleton sequence fusion method based on graph convolution comprises the following steps:
step one, enhancing data, adjusting the data into a multi-view framework sequence, and using the multi-view framework sequence as the input of a multi-branch network;
(1) Combining framework sequences of the single action sequence under all the visual angles into a multi-visual angle framework sequence;
(2) Rearranging the obtained multi-view framework sequence at a view level;
(3) The multi-view skeleton sequence rearranged in view level is used as the input of the multi-branch network, and each branch receives all views of the sample at different time.
Step two, extracting the time domain graph integration characteristic X of each visual angle by using a time-space graph convolution network at each branch 1 ,X 2 ,...,X n
(1) Selecting a space-time graph convolution network with the same structure for each branch;
(2) Extracting the time domain graph integration feature X of the input visual angle by using the selected time-space graph convolution network in each branch, wherein the method comprises the following steps: inputting a view-level rearranged multi-view skeleton sequence at each branch using a selected space-time graph convolutional network
Figure BDA0003859609770000021
Figure BDA0003859609770000022
Wherein c is 1 The number of channels of the framework sequence is t, the number of frames of the framework sequence is the time dimension, and v is the number of joint points of the framework sequence is the space dimension; averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristic X of each view angle 1 ,X 2 ,...,X n Time domain graph integration features forming the whole sequence
Figure BDA0003859609770000031
Wherein c is 2 Integrating the number of channels of the characteristic X for the time domain graph;
(3) Extracting partial end points and connecting points from a human skeleton as joint points, and obtaining a time domain graph integrated representation X of each joint point by space dimension segmentation :,j
Thirdly, constructing a multi-view fusion graph M by combining the natural topological relation of the human body, the corresponding relation of the joint points among the views and the graph integration characteristics of the joint points in the integration space;
(1) The connection between the joint points is the natural topological relation of the human body; constructing an adjacency matrix A by combining the natural topological relation of the human body and the corresponding relation of the joint points between the visual angles;
(2) Performing Laplace transform on the adjacency matrix A to obtain a natural connection diagram N;
(3) Constructing a corresponding similarity graph R according to the extracted graph integration representation of each joint point;
(4) The natural connection graph N and the similarity graph R are stored in the form of an adjacent matrix, matrix element values represent the strength of connection between nodes, the natural connection graph N and the similarity graph R are subjected to matrix level normalization respectively, then elements at corresponding positions are subjected to weighted summation, and the natural connection graph N and the similarity graph R are subjected to weighted fusion to obtain a multi-view fusion graph M;
fourthly, carrying out graph convolution according to the multi-view fusion graph M, and fusing the characteristics X 1 ,X 2 ,…,X n Obtaining a multi-view fused map integration feature X n+1 The method comprises the following steps:
(1) Integrating the extracted time domain graph of each view angle into a characteristic X 1 ,X 2 ,...,X n Splicing in space dimension, and constructing a feature vector X containing joint points under all visual angles n+1
(2) Will contain the feature vector X of the joint point under all view angles n+1 Inputting the graph M into a graph convolution network to carry out space-view graph convolution to obtain a multi-view fused graph integration characteristic Y n+1
Step five, multi-branch combined prediction is carried out, and the method comprises the following steps:
integrating the obtained multi-view fused graph into a feature Y n+1 And the extracted time domain map of each view integrates the feature X 1 ,X 2 ,...,X n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified initial prediction
Figure BDA0003859609770000032
Wherein class is the number of classes of the sample;
(2) Mapping the obtained classified preliminarily predicted data elements between (0, 1) by using a sigmoid function;
(3) Performing decision layer fusion on the mapped classified preliminary prediction in a mode of taking the maximum value as a final prediction result;
and step six, if the loss function corresponding to the end-to-end network framework is not converged, repeating the iteration steps from the second step to the fifth step.
Further, the loss function corresponding to the end-to-end network framework is specifically:
(1) Utilizing a cross entropy loss function to constrain the difference between the predicted result and the true value of each branch, and taking the difference as the loss function of each branch;
(2) The training samples of each branch are the same, so the commonly used automatic weighting loss function is not applicable; taking the reciprocal of each branch prediction variance as the weight of the branch loss function;
(5) The loss function of the whole network is the weighted sum of the loss functions of all the branches.
The technical scheme provided by the invention has the following beneficial effects:
1. when multi-view input is fused for carrying out skeleton-based behavior recognition, the method provides a space-view graph convolution application from the point of time-space graph convolution by utilizing the corresponding relation of joint points under different views in a graph integration space. Compared with the traditional fusion mode, the method is more intuitional, and the effect of the multi-view fusion characteristics obtained at the same time for prediction is better. In addition, the time-space graph convolution network integrated by the extraction graph can be freely selected, and the flexibility of the model is greatly enhanced.
2. The invention provides a multi-view data enhancement strategy and a matched loss function, and improves the overall accuracy of a model on the basis of solving the problem of unbalanced branch input of a traditional multi-branch model.
3. The invention provides an end-to-end training multi-branch graph neural network model, which utilizes a graph model to fuse all view angle characteristics and solves the problem of multi-view angle behavior identification based on a framework.
Drawings
FIG. 1: the corresponding relation (right graph) of the natural topological relation (left graph) of the human body and the joint points between the visual angles;
FIG. 2: identifying a problem case graph for the skeleton-based multi-view behavior;
FIG. 3: the method is a multi-time behavior identification method diagram based on a framework;
FIG. 4: the method is a flow chart of a skeleton-based multi-time behavior identification method;
FIG. 5: the detail map is rolled up for the multi-view map.
Detailed Description
Skeletal-based behavior recognition plays an important role in many applications of computer vision. The invention researches and fuses skeleton sequences under multiple visual angles shot by different cameras at the same time to identify behaviors, namely a fusion method of multi-visual angle skeleton sequences. The invention provides a novel multi-view framework sequence fusion method based on a graph convolution network to solve the problem. Firstly, data enhancement processing is carried out, and the view level full arrangement of the multi-view bone sequence is used as the input of the model. And extracting time domain graph integration characteristics of each joint point under each view angle under a single view angle by using a time-space graph convolution network, and compressing the time dimension while preserving the space dimension, namely the joint point. And then, a natural connection graph is established by utilizing the natural topological relation of the human body, the corresponding relation graph is calculated based on the integration according to the corresponding relation of the joint points among the visual angles, and the multi-visual-angle graph convolution is carried out by combining the natural connection graph and the integrated relation graph to obtain the multi-visual-angle fusion graph integration characteristic. And optimized for prediction in conjunction with the map integration features at a single view angle. In view of the features of this task, the present invention also proposes a data-driven bias-based loss function to improve the training and prediction performance of the model. The fusion method improves the prediction precision of the existing skeleton-based action recognition method and obtains good cross-domain performance.
According to the multi-view skeleton sequence fusion method based on graph convolution, the recognition accuracy is improved by fusing the representation of the same action sequence under multiple views at different stages, and as shown in FIG. 2, skeleton sequences from different views are fused and are commonly used for behavior recognition.
The invention models the multi-view action recognition problem of any view angle number based on a framework into a multi-objective optimization problem, combines a multi-branch network, and provides a novel fusion method based on graph convolution to solve the problem, wherein the flow design is shown in figure 4. The overall structure of the model is shown in fig. 3, and only the situation of two view angles is shown for the purpose of simplicity, and the situation of more view angles can be expanded on the basis. Extracting time domain graph integration characteristics of each view angle from the first half part of the network through a time-space graph convolution network; and in the latter half of the network, self-adaptive graph convolution is constructed by combining the natural topological relation of the human body, the corresponding relation of joint points among visual angles and the similarity of the joint points in an integration space, and the integrated characteristics of the multi-visual-angle fused graph are obtained and are optimized in a combined manner with the integrated characteristics of the single-visual-angle graph, so that the behavior identification problem based on the multi-visual-angle skeleton sequence is better solved. The invention can be used for the action sequence shot by a plurality of cameras in the same scene, and can simultaneously fuse the skeleton sequences under different visual angles in multiple stages under the condition of not limiting the number of visual angles, namely, the multi-visual angle skeleton sequence fusion.
The technical scheme of the invention is given below. The flow chart is shown in fig. 4.
And step s1, performing data enhancement on the single-view-angle skeleton sequence to be used as the input of a multi-branch network.
Step s2, extracting the time domain graph integration characteristic X of each visual angle by using a time-space graph convolution network in each branch 1 ,X 2 ,...,X n
And step s3, constructing a multi-view fusion graph M by combining the topological relation of the human body, the corresponding relation of the joint points among the views and the similarity of the joint points in the integration space.
Step s4, carrying out multi-view image convolution according to the multi-view fusion image B to obtain multi-view fusion image integration characteristic X n+1
Step s5, combining the single visual angle characteristic X 1 ,X 2 ,…,X n And multi-view fusion feature X n+1 And performing joint prediction.
And s6, if the loss function corresponding to the end-to-end network framework is not converged, repeating the iteration steps from the second step to the fifth step.
The multi-view framework sequence fusion method based on graph convolution comprises the following specific implementation steps:
data enhancement:
before training, the single-view skeleton sequence is first adjusted into a multi-view skeleton sequence. The method comprises the following specific steps:
(1) And combining the skeleton sequences of the single action sequence under all the visual angles into a multi-visual angle skeleton sequence.
(2) And rearranging the obtained multi-view framework sequences at a view level.
Description 1: view angle level rearrangement
The data used by the invention comprises the skeleton sequences which are shot and acquired from a plurality of visual angles at the same time by the same action sequence, and the input branch of the single-visual angle skeleton sequence is changed on the premise of keeping the original time sequence and the original null sequence of the skeleton sequences unchanged.
(3) The full arrangement of the multi-view skeleton sequence in the view dimension is used as the input of the network, that is, each branch receives all views of the sample at different time as the input, so as to balance the input of each branch.
(II) extracting the time domain graph integration characteristics of a single view: extracting time domain graph integration characteristic X of each view angle by using a time-space graph convolution network at each branch 1 ,X 2 ,...,X n
The specific method for extracting the time domain graph integration features comprises the following steps:
(1) And selecting a time-space diagram convolution network with the same structure for each branch.
Description 2: selection of a space-time graph convolutional network
The space-time graph convolutional network can be regarded as a feature extraction network, and the mainstream method in the field of behavior recognition based on a framework at present is a space-time graph convolutional neural network. And (3) regarding the skeleton sequence as a space-time diagram comprising time connection and space connection, and alternately performing time dimension convolution and space dimension convolution. The graph convolution neural network is used as a space-time graph convolution network, and reserved dimensions and specifications can be selected autonomously.
(2) Inputting single-view skeleton sequence at each branch by using selected space-time graph convolution network
Figure BDA0003859609770000061
Wherein c is 1 The number of channels of the skeleton sequence (actually meaning two-dimensional or three-dimensional coordinates of the joint points), t is the frame number of the skeleton sequence, i.e. the time dimension, and v is the number of joint points of the skeleton sequence, i.e. the space dimension. Averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristics of the whole sequence
Figure BDA0003859609770000062
Wherein c is 2 The number of channels of feature X is integrated for the time domain plot.
(3) In the space dimension division, obtaining the time domain graph integration representation X of each joint point :,j
Description 3: graph-integrated representation of joint points
The space dimension of the extracted features of the space-time graph convolutional network is consistent with the specification of the original skeleton sequence, and the time dimension is compressed, so that the extracted feature vector of the designated position j can be regarded as the time graph integration of the joint point.
(III) acquiring a multi-view fusion map M: constructing a multi-view fusion graph M by combining the natural topological relation of human bodies, the corresponding relation of joint points among views and the graph integration characteristics of the joint points in an integration space
The specific method for calculating the multi-view fusion map M through the natural connection relationship and the joint point similarity is shown in fig. 5:
(1) Partial end points and connection points are extracted from the human skeleton to serve as joint points, the connection among the joint points is the natural topological relation of the human body, and the same joint points under different visual angles also have corresponding relation. And constructing an adjacency matrix A by combining the natural topological relation of the human body and the corresponding relation of the joint points between the visual angles.
(2) And performing Laplace transform on the adjacency matrix A to obtain a natural connection diagram N.
(3) And constructing a similarity graph R between the joint points according to the extracted graph integration representation of each joint point.
(4) And normalizing the natural connection graph N and the similarity graph R, and then performing weighted fusion to obtain a multi-view fusion graph M.
Description 4: weighted fusion of graphs
The natural connection graph N and the similarity graph R are stored in a form of an adjacent matrix as in a reference document [1], matrix element values represent the strength of connection between nodes, matrix-level normalization is respectively carried out on the natural connection graph N and the similarity graph R, and then element weighted summation of corresponding positions is carried out. The weights are trainable parameters shared by the whole matrix, the natural connection graph N and the similarity graph R respectively share one weight on the whole graph, and the sum of the weights of the natural connection graph N and the similarity graph R is 1, so that the graph is endowed with greater flexibility.
(IV) fusing the single-view characteristics: performing graph convolution according to the multi-view fusion graph M, and fusing the features X 1 ,X 2 ,…,X n Obtaining a multi-view fused map integration feature X n+1
The method for fusing the integrated features of the single view map comprises the following steps:
(1) Extracting the integrated feature X 1 ,X 2 ,...,X n Splicing in space dimension, and constructing a feature vector X containing joint points under all visual angles n+1
(2) For the purpose of suppressing the over-smoothing trend of the map, the output in (1), namely the feature vector X containing the joint points under all the view angles n+1 Inputting the multi-view fusion graph M into a graph volume block, and performing space-view graph convolution by using the space-time graph convolution network structure for extracting the features to obtain the multi-view fusion graph integration feature Y n+1
Description 5: spatio-visual angle diagram convolution
After the temporal dimension is compressed, its position is replaced by the view dimension. The original skeleton sequence can be regarded as a complete time-space diagram, and in this case, can be regarded as a space-view diagram, which is different in that the space coordinates corresponding to the joint points of the original image are replaced by a diagram integration representation. Consider a spatio-temporal graph convolution performed in the same way as a spatio-temporal graph convolution. Since the number of nodes is small at this time, the entire system can be directly convolved.
(V) Multi-branch Joint prediction, optimization
The joint prediction and optimization method of the single-view branch and the fusion multi-view branch comprises the following steps:
(1) Resulting multi-view fused graph integration feature Y n+1 And extracted integration feature X 1 ,X 2 ,...,X n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified initial prediction
Figure BDA0003859609770000071
Where class is the number of classes of the sample.
(2) The resulting classified preliminary predicted data elements are mapped between (0, 1) using the sigmoid function.
(3) Performing decision layer fusion on the mode of taking the maximum value of the primary prediction of the mapped classification as the final prediction resultSay thatBright 6: decision level fusion
The fusion of multi-branch models can be divided into data fusion, feature fusion, model fusion and decision layer fusion (newly added references [ 5 ]) according to processing stages, and the invention adopts decision layer fusion.
The prediction results of the branches belong to the same feature space, so that a plurality of traditional fusion modes can be adopted, and reference 4 compares the traditional fusion modes at different levels of the model.
The invention adopts maximum value fusion, and takes the maximum value of the predicted value of the same category of each branch as output.
Description 7: loss function construction
Step 1: a single branch penalty function. Consider step (one) by rearranging the input data such that the branches are trained in different orders on the same data. Therefore, a conventional cross entropy loss function is employed:
Figure BDA0003859609770000081
wherein
Figure BDA0003859609770000082
Representing the output result of the network, and the value range of the value is (0, 1); y represents the true predicted value and can only take 0 or 1.
Step 2: a loss function based on the deviation. Reference 3 proposes that the generalization error of the integrated model depends on the covariance between submodels, provided that the deviation is smaller than that of the submodels. Since each branch can be predicted independently, the multi-branch model can be viewed as an integrated model that integrates the various branches. Although the same training data reduces inter-branch covariance, it is also possible to place the ensemble model in local optimality, so the invention constructs bias-based penalty functions.
Figure BDA0003859609770000083
Assume an input of m view angles, where w i The weight of branch i, the sum of the branch weights is 1. When i is equal to [1,m ]]When the temperature of the water is higher than the set temperature,
Figure BDA0003859609770000084
is the prediction result of the single-view branch; when i = m +1, the number of the terminals is increased,
Figure BDA0003859609770000085
is the predicted result of the multi-time fusion branch. y represents the true predictor. Weight w i Inversely correlating the bias of each branch, using a variance loss function to measure the bias of the branch, such that
Figure BDA0003859609770000086
In this way, the branches with higher fitting degree have larger learning rate, and simultaneously effectively restrain the change in other branches. On the premise of ensuring basic accuracy, the loss function based on the deviation can enable each branch to be converged to the global optimum more easily, so that overfitting of the model is preventedAnd finally, the overall accuracy of the model is improved.
(VI) judging whether the model is trained completely
The specific method for judging whether the model is trained completely comprises the following steps:
in the training process of the neural network, whether the training can be stopped can be judged according to the loss function value. The training may be stopped when the loss function drops to a certain extent substantially unchanged.

Claims (2)

1. A multi-view skeleton sequence fusion method based on graph convolution comprises the following steps:
step one, enhancing data, adjusting the data into a multi-view framework sequence, and using the multi-view framework sequence as the input of a multi-branch network;
(1) Combining framework sequences of the single action sequence under all the visual angles into a multi-visual angle framework sequence;
(2) Rearranging the obtained multi-view framework sequence at a view level;
(3) The multi-view skeleton sequence rearranged in view level is used as the input of the multi-branch network, and each branch receives all views of the sample at different time.
Step two, extracting the time domain graph integration characteristic X of each visual angle by using a time-space graph convolution network at each branch 1 ,X 2 ,...,X n
(1) Selecting a time-space diagram convolution network with the same structure for each branch;
(2) Extracting the time-domain graph integration feature X of the input visual angle by using the selected time-space graph convolution network in each branch, wherein the method comprises the following steps: inputting the multi-view skeleton sequence rearranged by view level at each branch by using the selected space-time diagram convolution network
Figure FDA0003859609760000011
Wherein c is 1 The number of channels of the skeleton sequence is t, the frame number of the skeleton sequence is the time dimension, and v is the joint point number of the skeleton sequence is the space dimension; averaging the time dimensions on the basis of keeping the space dimensions, and extracting the time domain graph integration characteristic X of each view angle 1 ,X 2 ,...,X n Time domain graph integration features forming the whole sequence
Figure FDA0003859609760000012
Wherein c is 2 Integrating the number of channels of the characteristic X for the time domain graph;
(3) Extracting partial end points and connection points from a human skeleton as joint points, and performing space dimension segmentation to obtain a time domain graph integrated representation of each joint point;
thirdly, constructing a multi-view fusion graph M by combining the natural topological relation of the human body, the corresponding relation of the joint points among the views and the graph integration characteristics of the joint points in the integration space;
(1) The connection between the joint points is the natural topological relation of the human body; constructing an adjacency matrix A by combining the natural topological relation of the human body and the corresponding relation of the joint points between the visual angles;
(2) Carrying out Laplace transform on the adjacency matrix A to obtain a natural connection diagram N;
(3) Constructing a corresponding similarity graph R according to the extracted graph integration representation of each joint point;
(4) The natural connection graph N and the similarity graph R are stored in the form of an adjacent matrix, matrix element values represent the strength of connection between nodes, matrix-level normalization is respectively carried out on the natural connection graph N and the similarity graph R, then element weighting summation is carried out on corresponding positions, and the natural connection graph N and the similarity graph R are subjected to weighting fusion to obtain a multi-view fusion graph M;
fourthly, graph convolution is carried out according to the multi-view fusion graph M, and the features X are fused 1 ,X 2 ,…,X n Obtaining a multi-view fused map integration feature X n+1 The method comprises the following steps:
(1) Integrating the extracted time domain map of each view into a feature X 1 ,X 2 ,...,X n Splicing in space dimension, and constructing a feature vector X containing joint points under all view angles n+1
(2) Will contain the feature vector X of the joint point under all the view angles n+1 Inputting the graph M into a graph convolution network to carry out space-view graph convolution to obtain a multi-view fused graph integration characteristic Y n+1
Step five, multi-branch combined prediction is carried out, and the method comprises the following steps:
integrating the obtained multi-view fused graph into a feature Y n+1 And the extracted time domain map of each view integrates the feature X 1 ,X 2 ,...,X n Respectively performing global average pooling of spatial dimensions, and inputting n +1 fully-connected layers independent from each other to obtain classified preliminary prediction
Figure FDA0003859609760000021
Wherein class is the number of classes of the sample;
(2) Mapping the obtained classified preliminarily predicted data elements between (0, 1) by using a sigmoid function;
(3) Performing decision layer fusion on the mapped classified preliminary prediction in a mode of taking the maximum value as a final prediction result;
and step six, if the loss function corresponding to the end-to-end network framework is not converged, repeating the iteration steps from the second step to the fifth step.
2. The multi-view skeleton sequence fusion method of claim 1, wherein the loss function corresponding to the end-to-end network frame is specifically:
(1) Utilizing a cross entropy loss function to constrain the difference between the predicted result and the true value of each branch, and taking the difference as the loss function of each branch;
(2) The training samples of each branch are the same, so the commonly used automatic weighting loss function is not applicable; taking the reciprocal of each branch prediction variance as the weight of the branch loss function;
(3) The loss function of the whole network is the weighted sum of the loss functions of all the branches.
CN202211157830.6A 2022-09-22 2022-09-22 Multi-view framework sequence fusion method based on graph convolution Pending CN115497164A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211157830.6A CN115497164A (en) 2022-09-22 2022-09-22 Multi-view framework sequence fusion method based on graph convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211157830.6A CN115497164A (en) 2022-09-22 2022-09-22 Multi-view framework sequence fusion method based on graph convolution

Publications (1)

Publication Number Publication Date
CN115497164A true CN115497164A (en) 2022-12-20

Family

ID=84470862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211157830.6A Pending CN115497164A (en) 2022-09-22 2022-09-22 Multi-view framework sequence fusion method based on graph convolution

Country Status (1)

Country Link
CN (1) CN115497164A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524601A (en) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524601A (en) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
CN116524601B (en) * 2023-06-21 2023-09-12 深圳市金大智能创新科技有限公司 Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot

Similar Documents

Publication Publication Date Title
WO2022036777A1 (en) Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN111291809B (en) Processing device, method and storage medium
CN109753903A (en) A kind of unmanned plane detection method based on deep learning
Huang et al. Immature apple detection method based on improved Yolov3
CN112990211A (en) Neural network training method, image processing method and device
CN112560918B (en) Dish identification method based on improved YOLO v3
CN108171249B (en) RGBD data-based local descriptor learning method
CN111881840B (en) Multi-target tracking method based on graph network
CN110443849B (en) Target positioning method for double-current convolution neural network regression learning based on depth image
CN110222718A (en) The method and device of image procossing
CN108764244A (en) Potential target method for detecting area based on convolutional neural networks and condition random field
CN114998525A (en) Action identification method based on dynamic local-global graph convolutional neural network
CN115937774A (en) Security inspection contraband detection method based on feature fusion and semantic interaction
CN115359366A (en) Remote sensing image target detection method based on parameter optimization
CN115497164A (en) Multi-view framework sequence fusion method based on graph convolution
CN115561834A (en) Meteorological short-term and temporary forecasting all-in-one machine based on artificial intelligence
CN114724021B (en) Data identification method and device, storage medium and electronic device
CN115187786A (en) Rotation-based CenterNet2 target detection method
Kadim et al. Deep-learning based single object tracker for night surveillance.
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN111914938A (en) Image attribute classification and identification method based on full convolution two-branch network
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN112508863B (en) Target detection method based on RGB image and MSR image double channels
CN113609904A (en) Single-target tracking algorithm based on dynamic global information modeling and twin network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination