CN114036969B

CN114036969B - 3D human body action recognition algorithm under multi-view condition

Info

Publication number: CN114036969B
Application number: CN202110280476.5A
Authority: CN
Inventors: 石昕; 邵慧杨; 翟庆庆
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2023-07-25
Anticipated expiration: 2041-03-16
Also published as: CN114036969A

Abstract

The invention discloses a 3D human body action recognition algorithm under the condition of multiple visual angles, which is divided into single-view 3D gesture estimation and multiple-view 3D gesture estimation; regarding single view 3D pose estimation, it can be divided into two subcategories, the first class using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D respectively by a deep neural network; the second category uses convolutional neural networks to infer 3D coordinates directly from images; with respect to multi-view 3D pose estimation, it is intended to obtain a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch, as input to a fully connected network that is trained to predict global 3D joint coordinates. The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data display which can be understood by a user.

Description

3D human body action recognition algorithm under multi-view condition

Technical Field

The invention relates to the fields of computer vision recognition, real-time data visualization and big data parallel processing, in particular to a 3D human body action recognition algorithm under the condition of multiple visual angles.

Background

With the development and progress of society, the role played by human behavior recognition technology in society is more and more important, and the human behavior recognition technology has a wide application scene. The three-dimensional human body model reconstruction and motion recognition are hot spots in the current computer vision field research, and aim to extract and analyze the motion in the video through various image processing and recognition classification technologies, reasonably construct a complete three-dimensional human body model to judge the motion performed by the person in the video, thereby obtaining useful information, and having very wide application. The human behavior recognition technology can be applied to the fields of video monitoring (environments such as schools, canteens, companies and the like), man-machine interaction (scenes such as train stations and the like), football or basketball sport automatic explanation and the like.

Furthermore, human gesture recognition is a very important area in the field of computer vision. According to the differences formulated by the final target and the assumption rules, a plurality of different directions can be extended;

(1) Two-dimensional or three-dimensional motion of the human body is predicted.

(2) Human motion is predicted from a single sequence or frame in the video.

(3) Human motion is predicted from a single or multiple cameras.

In the invention, we only focus on the recognition of human body actions in three-dimensional space within the range of a fixed frame under the condition of multiple cameras. From a broader perspective, the motion detection framework provided by the invention can be used as a unified recognition framework to simultaneously recognize human motion in 2D and 3D.

The 3D human motion recognition is a fundamental problem in computer vision, and is applied to sports motion recognition, computer-aided live broadcast, man-machine interaction, special effect production and the like at ordinary times. Most conventional algorithms currently focus on 3D human motion prediction for a single view. Although many related works have been done recently by scholars, recognition of human motion under multi-camera conditions has not been addressed far. Therefore, the invention provides a 3D human motion recognition algorithm under the condition of multiple visual angles.

Human body action recognition under the condition of multiple visual angles has high research value, and two reasons are: first, in outdoor complex scenes, multi-view human motion recognition is the best motion recognition method which is indisputable. This is because competing technologies such as marker-based motion capture and visual inertia methods have limitations such as the inability to capture rich gesture representations (e.g., estimating hand and face gestures and limb gestures) and various other limitations. A disadvantage of the previous method is that the work uses multi-view triangulation to construct a dataset that relies on too many, almost impractical view numbers to obtain a 3D real action of sufficient quality. This makes the collection of new datasets for 3D gesture recognition very challenging, and there is an urgent need to reduce the number of views required for accurate triangulation. Secondly, in some cases, the algorithm can directly use the human body gesture tracking algorithm to track the human body gesture in real time so as to achieve the final purpose of identifying the action. This is because multi-camera configurations are becoming increasingly available in the context of various applications such as sports or computer-aided life. In this case, the accuracy of modern multi-view methods is comparable to developed monocular methods. Thus, improving the accuracy of multi-view pose estimation from few views is a significant challenge in direct practical applications.

Disclosure of Invention

The invention aims to provide a 3D human body action recognition algorithm under the condition of multiple visual angles, which is used for detecting and recognizing actions related to human bodies by adopting a computer vision recognition algorithm and converting the actions into data display which can be understood by users.

The technical scheme adopted by the invention is as follows: the 3D human motion recognition algorithm under the condition of multiple visual angles is characterized in that: and 3D gesture estimation is carried out by adopting a multi-angle information aggregation method after 2D gesture estimation under multiple views.

Regarding single view 3D pose estimation, two subcategories are divided, the first class using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D by a deep neural network (fully connected, convolved, or recursive), respectively; the second category uses deep convolutional neural networks to infer 3D coordinates directly from images; the 3D human motion recognition algorithm uses a first type of method as a main frame and uses a deep convolutional neural network as a high-quality 2D gesture estimation engine;

regarding multi-view 3D pose estimation, a true annotation aimed at obtaining monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch as input to a fully connected network trained to be able to predict global 3D joint coordinates; the method in which 2D coordinates are concatenated under the same coordinate system is called a multi-angle information aggregation method.

The deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, an input layer of the deep convolutional neural network can process multidimensional data, and an input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays are usually time sequence data; the two-dimensional array is mostly a gray scale map; an input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images;

the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; the convolution kernel in the convolution layer contains weight coefficients, the pooling layer does not contain weight coefficients, the function of the convolution layer is to perform feature extraction on input data, the convolution layer internally contains a plurality of convolution kernels, each element composing the convolution kernels corresponds to a weight coefficient and a deviation amount, and the convolution kernels are similar to neurons of a feedforward neural network; the algorithm of the convolution layer is as follows:

after the feature extraction is carried out on the convolution layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of the adjacent area; the pooling layer selects pooling areas which are the same as the step of the convolution kernel scanning characteristic diagram and are controlled by pooling size, step length and filling; the general expression form is:

the output layer upstream of the convolutional neural network is usually a fully-connected layer, and the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network.

The multi-angle information aggregation method is a multi-angle human body coordinate systemThe conversion method is concretely formed by algebraic triangular transformation; processing each joint j separately using a trigonometric transformation; the method is established on the triangle transformation method in the 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the action recognition frame; h _c,j ＝h _θ (I _c ) _j To estimate 2D joint position information, a softmax layer on the spatial axis is first calculated:

secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, which is called soft-argmax;

an important feature of Soft-argmax is that the index of the maximum feature is not obtained, and the heat map H is convenient _c Carrying out gradient back propagation; the two-dimensional human body recognition frame uses Loss to pretrain, the joint heat in the graph is adjusted by multiplying the heat graph and the reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;

from 2D joint position information x _c,j Inferring three-dimensional joint position information using a linear trigonometric transformation method that reduces the number of points of the joint y _j The search amount of 3D coordinates of the joint y solves an overdetermined equation set on a homogeneous 3D coordinate vector of the joint y:

A _j y _j ＝0

wherein the method comprises the steps ofIs x _c,j Is a projection matrix of the projection matrix.

The linear triangular transformation method comprises the following steps: assuming that the joint coordinates of each view are independent of each other, they all contribute comparably to the triangular variation; learner weights of corresponding coefficient matrix under different anglesw _c ；

w _j ＝(ω _1,j ,ω _2,j ,…,ω _C,j ) The method comprises the steps of carrying out a first treatment on the surface of the The operator of degree represents the Hadamard product, weight omega _c,j Is a convolutional neural networkThe output result is: />The input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joint and confidence of the camera joint, by applying soft-argmax, the 2D position of the joint can be deduced from the 2D joint heat map, and the 2D position and confidence are passed together to an algebraic triangulation module that outputs a triangulated 3D pose, all modules allowing for back-propagation gradients, so the model can be trained end-to-end.

The advantages of the first class of single view 3D pose estimation are: simple, fast, training (with skeleton/view enhancement) on motion capture data, and switching 2D skeletons after training;

among the advantages of multi-view 3D pose estimation are: this approach can effectively use information from different views and can train on motion capture data.

In fact, few of the current mainstream studies use volumetric pose representations in a multi-view setup. In particular, non-projected to volume and subsequent non-learnable aggregation using 2D keypoint probability heat maps (obtained from pre-trained 2D keypoint detectors). Our work differs in two ways. First, we process the information within the volume in a learnable manner. Second, we perform end-to-end training on the network, thereby tuning the 2D backbone and alleviating the need for 2D heatmap interpretability. This allows transferring several self-consistent pose assumptions from the 2D detector to the volume aggregation phase (previous designs were not possible).

There have also been studies using a multi-stage method to infer a 3D pose from the coordinates of the 2D joint prior to an external 3D pose. In the first stage, the images of all views are transferred through a deep convolutional neural network to obtain a heat map of the 2D joint. The maximum locations in the heat map are used together to infer the 3D pose by optimizing the potential coordinates in the 3D pose a priori space. At each subsequent stage, the 3D pose is re-projected to all camera views and fused with predictions from the previous layer (through the convolutional network). Next, the 3D pose is re-estimated from the position of the heat map maximum, and then the process is repeated. Such a procedure allows correction of predictions of the 2D joint heat map by indirect global reasoning about human gestures. In contrast to our approach, there is no study of the gradient flow from 3D prediction to 2D heat maps, and therefore no direct signal to correct the prediction of 3D coordinates.

A3D human body motion recognition algorithm under the condition of multiple visual angles is used for recognizing human body motion in a three-dimensional space within the range of a fixed frame under the condition of multiple cameras, the motion detection frame can be used as a unified recognition frame to simultaneously recognize human body motion in 2D and 3D, and 2D motion recognition can be quickly expanded to 3D motion recognition through the frame. We use this framework to add human bones, joints, and various constraints from the pictures in three-dimensional space.

Regarding the action recognition framework, assume that we have synchronized C cameras to a unified global coordinate system using a projection matrix, to facilitate obtaining human data in a scene; our goal is to estimate the position y of the three-dimensional joint point of the human body of joint J e (1 …, J) at time t in the global coordinate system _j,t . For each frame we use an off-the-shelf 2D human detection algorithm or a bounding box in the dataset that is self-contained to crop the image. Subsequently we use the cropped image I _c As training data to the deep convolutional neural network framework.

The deep convolutional neural network framework is defined by ResNet-152 (parameter weight θ, network output g _θ ) In a series of outputsTransposed convolution layer of the inter-heat map (output is f _θ ) And a convolutional neural network (output h) using a kernel of size 1 x 1 to convert the intermediate heat map into an interpretable joint heat map _θ The output dimension and the number of joints are the same).

The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data display which can be understood by a user.

Drawings

FIG. 1 is a schematic diagram of a method for recognizing human motion from multiple angles in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep convolutional neural network in accordance with one embodiment of the present invention;

fig. 3 is a schematic structural diagram of a multi-angle information aggregation method according to an embodiment of the present invention.

Detailed Description

The invention provides a 3D human motion recognition algorithm under the condition of multiple visual angles, which is characterized in that: the human body action recognition algorithm is divided into single-view 3D gesture estimation and multi-view 3D gesture estimation;

regarding single view 3D pose estimation, it can be divided into two subcategories, the first using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D respectively by a deep neural network (fully connected, convolved or recursive); the second category uses deep convolutional neural networks to infer 3D coordinates directly from images; the invention uses a first type of method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine.

Deep convolutional neural network

The deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, and is one of representative algorithms of deep learning. The deep convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the deep convolutional neural network, so the deep convolutional neural network is also called as a 'translation invariant artificial neural network'. In recent years, convolutional neural networks have amplified highlights on individual image recognition events. Therefore, the present invention uses a deep convolutional neural network as a 2D pose estimation engine, and a structural diagram about the convolutional neural network is shown in fig. 2.

The input layer of the deep convolutional neural network can process multidimensional data, and commonly, the input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays can be time sequence data generally; the two-dimensional array is mostly a gray scale map; the input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images.

The hidden layers of the deep convolutional neural network comprise convolutional layers, pooling layers and full-connection layer 3 common structures. In a common construction, the convolutional layer and the pooling layer are specific to the deep convolutional neural network. The convolution kernels in the convolution layer contain weight coefficients, while the pooling layer does not. The function of the convolution layer is to perform feature extraction on the input data, and the convolution layer internally contains a plurality of convolution kernels, wherein each element constituting the convolution kernels corresponds to a weight coefficient and a deviation amount, and is similar to neurons of a feedforward neural network. The algorithm of the convolution layer is as follows:

after the feature extraction is performed by the convolution layer, the output feature map is transferred to the pooling layer for feature selection and information filtering. The pooling layer contains a predefined pooling function that functions to replace the results of individual points in the feature map with the feature map statistics of its neighboring regions. The pooling layer selects pooling area and the step of the convolution kernel scanning characteristic diagram are the same, and the pooling area, step length and filling are controlled. The general expression form is:

the output layer upstream of the convolutional neural network is usually a fully-connected layer, so that the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network. For the human motion recognition problem, the output layer is a classification label of different motions, and the specific expression form is shown in fig. 2.

With respect to multi-view 3D pose estimation, it is intended to obtain a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch, as input to a fully connected network that is trained to predict global 3D joint coordinates. The method in which 2D coordinates are concatenated under the same coordinate system is called a multi-angle information aggregation method. The multi-angle information aggregation method is a novel multi-angle human body coordinate system conversion method provided by the invention.

Multi-angle information aggregation method

The multi-angle information aggregation method is concretely formed by algebraic triangular transformation. We can use a trigonometric transformation to process each joint j individually. The method is based on a triangle transformation method in 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the motion recognition frame. H _c,j ＝h _θ (I _c ) _j To estimate 2D joint position information, we first calculate the softmax layer on the spatial axis:

the parameter α will be discussed later, and then we calculate the center position of the 2D position information heat map of each node as the position estimate (called soft-argmax) for that node.

An important feature of Soft-argmax is that the index of the largest feature is not obtained, which facilitates the heat map H _c Gradient back propagation is performed. Because the two-dimensional human recognition framework is pre-trained using Loss. We adjust the joint heat in the map by multiplying the heat map and the inverted heat parameter α so that the start of the training process of soft-argmax outputs the most likely position.

To obtain 2D joint position information x _c,j Three-dimensional joint position information is inferred, and we use a linear trigonometric transformation method. The method reduces the articulation y _j Thereby solving the system of overdetermined equations on the homogeneous 3D coordinate vector of the joint y:

A _j y _j ＝0

The naive triangulation algorithm assumes that the joint coordinates of each view are independent of each other and thus all contribute comparably to the triangulation. However, on some views, the 2D position of the joint cannot be estimated reliably (e.g., due to joint occlusion), resulting in an unsatisfactory final triangulation result. This greatly exacerbates the tendency of methods that optimize algebraic reprojection errors to be prone to imbalance levels in different directions. This problem can be solved by using RANSAC together with Huber losses (for scoring the re-projection errors corresponding to internal errors). However, this has a relative disadvantage. For example, using RANSAC may completely cut off the gradient flow to the exclusion of the camera. To solve the above mentioned text, we add a learnable weight w of the corresponding coefficient matrix at different angles _c 。

w _j ＝(ω _1,j ,ω _2,j ,…,ω _C,j ) The method comprises the steps of carrying out a first treatment on the surface of the The ° operator represents the Hadamard product. Weight omega _c,j Is a convolutional neural networkThe output result.

An overview of a trigonometric transformation method based on learning confidence. The input to the method is a set of RGB images with known camera parameters. The 2D human recognition algorithm generates a heat map of the joint and a confidence level of the camera joint. By applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map. The 2D position is passed along with the confidence level to an algebraic triangulation module that outputs a triangulated 3D pose. All modules allow for back-propagation gradients so the model can be trained end-to-end.

With respect to the application scenario of the present invention, along with the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent times. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by utilizing a computer, and finally obtaining a high-level semantic interpretation and automatic judgment result of the human body posture. The human body gesture recognition technology has wide application and development prospects in the fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, film and game action recording, military target recognition and the like. The human body posture is identified based on human body skeleton characteristics, and the skeleton is a topological structure description mode of an object, and is widely applied to the fields of road interrogation, path planning, characteristic identification and the like. The main working object and the working content of the invention are to find a frame which is easy to calculate. With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent times. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by utilizing a computer, and finally obtaining a high-level semantic interpretation and automatic judgment result of the human body posture. The human body gesture recognition technology has wide application and development prospects in the fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, military target recognition and the like.

Regarding the connection between the skeleton tracking principle and our research, the common skeleton tracking principle simply uses the picture information of a single camera, and the common CNN network is adopted to directly fit the picture information, so that the effect is completely dependent on the richness of the data set. Due to the problems of human body limb shielding and the like, the problem of invisible limb recognition is solved by adopting a plurality of cameras, and the accuracy of a recognition result is improved by adopting high-accuracy 2D gesture estimation and converting the high-accuracy 2D gesture into a 3D gesture through triangular transformation.

The present invention introduces two novel approaches to multi-view 3D human pose estimation based on a learnable trigonometric transformation that achieve the most advanced performance on the human3.6m dataset. The proposed solution greatly reduces the number of views required to obtain high accuracy and generates a smooth sequence of poses on the CMU panotic dataset without any time processing, which can potentially improve the labeling problem of the target dataset. We speculate that this approach is robust to occlusion and partial views of a person because of its perspective capability in learning the pose of the person. Another important advantage of this approach is that it explicitly takes camera parameters as independent inputs. Finally, if the approximate location of a human is known, the volumetric triangulation can also be generalized to a monocular image, producing results approaching those of the state of the art.

Claims

1. The 3D human motion recognition algorithm under the condition of multiple visual angles is characterized in that:

multi-view 3D pose estimation aimed at obtaining a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch as input to a fully connected network trained to be able to predict global 3D joint coordinates; the method in which 2D coordinates are connected in series in the same coordinate system is called a multi-angle information aggregation method;

the multi-angle information aggregation method is a multi-angle human body coordinate system conversion method, and the specific form is algebraic triangular transformation; processing each joint j separately using a trigonometric transformation; the method is established on the triangle transformation method in the 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the action recognition frame; h _c，j ＝h _θ (I _c ) _j ；

To estimate 2D joint position information, a softmax layer on the spatial axis is first calculated:

one important feature of Soft-argmax is that the index of the maximum feature is not acquired, so that gradient back propagation of a heat map is facilitated; the two-dimensional human body recognition frame uses Loss to pretrain, the joint heat in the graph is adjusted by multiplying the heat graph and the reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;

from 2D joint position information x _c，j The three-dimensional joint position information is deduced, and a linear triangular transformation method is used, so that the searching amount of 3D coordinates of the joint j is reduced, and an overdetermined equation set on a homogeneous 3D coordinate vector of the joint j is solved:

A _j y _j ＝0；

wherein the method comprises the steps ofIs x _c，j Is a projection matrix of (a);

the linear triangular transformation method comprises the following steps: assuming that the joint coordinates of each view are independent of each other, they all contribute comparably to the triangular variation; the learnable weights of the corresponding coefficient matrixes under different angles;

w _j ＝(ω _1，j ，ω _2，j ，...，ω _C，j )：the operator represents the Hadamard product, the weight omega _c，j Is convolutional neural network->The output result is: />The input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joint and confidence of the camera joint, by applying soft-argmax, the 2D position of the joint can be deduced from the 2D joint heat map, and the 2D position and confidence are passed together to an algebraic triangulation module that outputs a triangulated 3D pose, all modules allowing for back-propagation gradients, so the model can be trained end-to-end.

2. The 3D human motion recognition algorithm in a multi-view situation according to claim 1, wherein:

the deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, multidimensional data can be used as input of an input layer of the deep convolutional neural network, one-dimensional data or two-dimensional data are used as input to be transmitted to the input layer of the deep convolutional neural network, and a one-dimensional array is usually time sequence data; the two-dimensional array is mostly a gray scale map; the input layer of the convolutional neural network adopted by the invention receives the three-dimensional array of the RGB image;