CN114036969A

CN114036969A - 3D human body action recognition algorithm under multi-view condition

Info

Publication number: CN114036969A
Application number: CN202110280476.5A
Authority: CN
Inventors: 石昕; 邵慧杨; 翟庆庆
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-02-11
Anticipated expiration: 2041-03-16
Also published as: CN114036969B

Abstract

The invention discloses a 3D human body action recognition algorithm under the condition of multiple visual angles, which is divided into single-view 3D posture estimation and multi-view 3D posture estimation; with respect to single-view 3D pose estimation, it can be divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately through a deep neural network; the second category infers 3D coordinates directly from the image using convolutional neural networks; with respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates. The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data which can be understood by a user for display.

Description

3D human body action recognition algorithm under multi-view condition

Technical Field

The invention relates to the fields of computer vision identification, real-time data visualization and big data parallel processing, in particular to a 3D human body action identification algorithm under the condition of multiple visual angles.

Background

With the development and progress of the society, the role of the human behavior recognition technology in the society is more and more important, and the human behavior recognition technology has wide application scenes. The three-dimensional human body model reconstruction and action recognition are a hot spot in the current research in the field of computer vision, and the aim is to extract and analyze actions in a video through various image processing and recognition classification technologies, reasonably construct a complete three-dimensional human body model and judge actions of people in the video, so that useful information is obtained, and the method has very wide application. The human behavior recognition technology can be applied to the fields of video monitoring (environments such as schools, canteens and companies), man-machine interaction (scenes such as railway stations), automatic explanation of football or basketball sports and the like.

Furthermore, human gesture recognition is a very important area in the field of computer vision. Many different directions can be extended according to the final target and the different established hypothesis rules;

(1) two-dimensional or three-dimensional motion of a human body is predicted.

(2) Human motion is predicted from a single sequence or frame in the video.

(3) The human body motion is predicted from a single or multiple cameras.

In the invention, only the recognition of human body actions in a three-dimensional space within a fixed frame range is focused on under the condition of multiple cameras. From a wider perspective, the motion detection framework provided by the invention can be used as a unified recognition framework to recognize human body motions in 2D and 3D simultaneously.

The 3D human body action recognition is a basic problem in computer vision, and is usually applied to sports action recognition, computer-aided live broadcast, man-machine interaction, special effect making and the like. Most conventional algorithms today focus on single view 3D body motion prediction. Although many related tasks have been done recently by scholars, recognition of human motion under multiple camera conditions is far from being solved. Therefore, the invention provides a 3D human body motion recognition algorithm under the condition of multiple visual angles.

The human body action recognition under the condition of multiple visual angles has high research value for two reasons: first, in an outdoor complex scene, multi-view human motion recognition is not arguably the best motion recognition mode. This is because competing technologies such as marker-based motion capture and visual inertial methods have certain limitations, such as the inability to capture rich gesture representations (e.g., estimating hand and facial and limb gestures), and various other limitations. A disadvantage of the previous approach is that this work uses multi-view triangulation to construct data sets that rely on an excessive, almost impractical number of views to obtain 3D realistic motion of sufficient quality. This makes the collection of new data sets for 3D pose recognition very challenging and there is a strong need to reduce the number of views required for accurate triangulation. Secondly, in some cases, the algorithm can be directly used for tracking the human body posture in real time so as to achieve the final aim of recognizing the action. This is because the configuration of multiple cameras is becoming increasingly available in the context of various applications such as sports or computer-assisted living. In this case, the accuracy of modern multiview methods is comparable to the developed monocular method. Therefore, improving the accuracy of multi-view pose estimation from few views is a significant challenge in direct practical applications.

Disclosure of Invention

The invention aims to provide a 3D human body action recognition algorithm under the condition of multiple visual angles, which detects and recognizes actions related to human bodies by adopting a computer vision recognition algorithm and converts the actions into data which can be understood by users for display.

The technical scheme adopted by the invention is as follows: A3D human body action recognition algorithm under the condition of multiple visual angles is characterized in that: and after 2D posture estimation under multiple views, performing 3D posture estimation by adopting a multi-angle information aggregation method.

With respect to single-view 3D pose estimation, divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting of the 2D coordinates to 3D separately by a deep neural network (fully connected, convolved or recursive); the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the 3D human body action recognition algorithm uses a first method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine;

with respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates; the method of concatenating 2D coordinates into the same coordinate system is called a multi-angle information aggregation method.

The deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer deep structure, an input layer of the deep convolutional neural network can process multidimensional data, and an input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays are usually time sequence data; most of the two-dimensional arrays are gray level images; an input layer of the two-dimensional convolution neural network receives a three-dimensional array of RGB images;

the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; convolution kernels in the convolution layers comprise weight coefficients, the pooling layer does not comprise the weight coefficients, the convolution layers have the function of carrying out feature extraction on input data and comprise a plurality of convolution kernels, each element forming the convolution kernels corresponds to one weight coefficient and one deviation value and is similar to a neuron of a feedforward neural network; the convolution layer algorithm is as follows:

after the feature extraction is carried out on the convolutional layer, the output feature graph is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of an adjacent area; the step of selecting a pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled; it is generally represented in the form:

the output layer in the convolutional neural network is usually a fully-connected layer upstream, and the structure and the working principle of the fully-connected layer are the same as those of the output layer in the traditional feedforward neural network.

The multi-angle information aggregation method is a multi-view human body coordinate system conversion method, and the specific form is algebraic trigonometric transformation; each joint j is processed separately using a trigonometric transformation; the method is established on a trigonometric transformation method in 2D coordinates, wherein the information of the human joint coordinates is from heat maps of different angles in a motion recognition frame; h_c,j＝h_θ(I_c)_jTo estimate the 2D joint position information, the softmax layer on the spatial axis is first computed:

secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, wherein the central position is called soft-argmax;

an important feature of Soft-argmax is that the index of the maximum feature is not obtained, which is convenient for the heat map H_cCarrying out gradient back propagation; the two-dimensional human body recognition framework uses Loss for pre-training, joint heat in the graph is adjusted by multiplying the heat graph by a reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;

from 2D joint position information x_c,jInferring three-dimensional joint position information using a linear trigonometric transformation method that reduces the number of pairs of joints y_jThe search amount of the 3D coordinates of (2) solves the overdetermined equation set on the homogeneous 3D coordinate vector of the joint y:

A_jy_j＝0

wherein

Is x_c,jThe projection matrix of (2).

The linear triangular transformation method comprises the following steps: the joint coordinates of each view are assumed to be independent of each other and therefore all contribute comparably to the triangular variation; learnable weights w of corresponding coefficient matrices at different angles_c；

w_j＝(ω_1,j,ω_2,j,…,ω_C,j) (ii) a Operator represents Hadamard product, weight omega_c,jIs a convolutional neural network

The output result is:

the input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joints and confidence of the camera joints, from which 2D positions of the joints can be deduced by applying soft-argmax, the 2D positions and confidence together being passed to an algebraic trigonometric module which outputs a triangulated 3D pose, all modules allowing back-propagation gradients, so that the model can be trained end-to-end.

The advantages of the first category of single-view 3D pose estimation are: simple, fast, can train on motion capture data (with skeleton/view enhancement), and can switch 2D skeletons after training;

among the advantages of multi-view 3D pose estimation are: this approach may efficiently use information from different views and may train on motion capture data.

In fact, few studies currently use volumetric pose representations in multi-view settings. In particular, the non-projection into the volume and subsequent non-learnable aggregation of 2D keypoint probability heatmaps (obtained from pre-trained 2D keypoint detectors) are utilized. Our work differs in two ways. First, we process the information within the volume in a learnable way. Second, we train the network end-to-end, tuning the 2D backbone and alleviating the need for 2D heatmap interpretability. This allows several self-consistent posture hypotheses to be transferred from the 2D detector to the volume aggregation stage (not possible with previous designs).

There have also been studies using a multi-stage approach to extrapolate 3D poses from the coordinates of the 2D joints prior to the external 3D poses. In the first stage, the images of all views are passed through a deep convolutional neural network to obtain a heat map of the 2D joint. The locations of maxima in the heatmap are used together to infer the 3D pose by optimizing the potential coordinates in the 3D pose prior space. At each subsequent stage, the 3D pose is re-projected to all camera views and fused with the prediction from the previous layer (via the convolutional network). Next, the 3D pose is re-estimated based on the position of the heat map maxima, and the process is repeated. Such a procedure allows for correcting predictions of 2D joint heat maps through indirect global reasoning on human body posture. In contrast to our approach, there are studies that do not have a gradient flow from 3D prediction to 2D heatmaps, and therefore there is no direct signal to correct the prediction of 3D coordinates.

A3D human body action recognition algorithm under the condition of multiple visual angles is used for recognizing human body actions in a three-dimensional space within a fixed frame range under the condition of multiple cameras, an action detection frame can be used as a uniform recognition frame for recognizing human body actions in 2D and 3D at the same time, and 2D action recognition can be rapidly expanded to 3D action recognition through the frame. We use this framework to add human bones, joints, and various constraints in three-dimensional space, which are obtained from the picture.

Regarding the action recognition framework, suppose that the C cameras are synchronized to a unified global coordinate system by using a projection matrix, so that human body data in a scene can be conveniently acquired; our goal is to estimate the position y of the human body's three-dimensional joint point at time t for joint J e (1 …, J) in the global coordinate system_j,t. For each frame, we use the ready 2D bodyA detection algorithm or bounding box in the data set itself to crop the image. Then we use the cropped image I_cPassed as training data to the deep convolutional neural network framework.

The framework for the deep convolutional neural network is represented by ResNet-152 (parameter weight is theta, network output is g)_θ) A series of transposed convolution layers outputting intermediate heat maps (output f)_θ) And a convolutional neural network (output h) that converts the intermediate heat map into an interpretable joint heat map using a kernel of 1 × 1 size_θThe output dimensions and number of joints are the same).

The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data which can be understood by a user for display.

Drawings

FIG. 1 is a diagram illustrating a method for recognizing human body motion from multiple angles according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep convolutional neural network according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a multi-angle information aggregation method according to an embodiment of the present invention.

Detailed Description

The invention provides a 3D human body action recognition algorithm under the condition of multiple visual angles, which is characterized in that: the human body action recognition algorithm is divided into single-view 3D posture estimation and multi-view 3D posture estimation;

with respect to single-view 3D pose estimation, it can be divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately by a deep neural network (fully connected, convolved or recursive); the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the present invention uses a first type of method as a main frame, using a deep convolutional neural network as a high quality 2D pose estimation engine.

Deep convolutional neural network

The deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer deep structure, and is one of representative algorithms of deep learning. The deep convolutional neural network has a characterization learning capability, and can perform translation invariant classification on input information according to a hierarchical structure thereof, so that the deep convolutional neural network is also called a 'translation invariant artificial neural network'. In recent years, convolutional neural networks have been shown to be highly distinctive over individual image recognition tournaments. Therefore, the present invention uses a deep convolutional neural network as a 2D pose estimation engine, and a structural diagram of the convolutional neural network is shown in fig. 2.

The input layer of the deep convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays may be time sequence data; most of the two-dimensional arrays are gray level images; the input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images.

The hidden layers of the deep convolutional neural network comprise convolutional layers, pooling layers and full-link layer 3 common structures. In a common architecture, convolutional and pooling layers are characteristic of deep convolutional neural networks. The convolution kernels in the convolutional layers contain weight coefficients, while the pooling layers do not. The function of the convolutional layer is to extract the characteristics of input data, the convolutional layer internally comprises a plurality of convolutional kernels, and each element forming the convolutional kernels corresponds to a weight coefficient and a deviation amount, and is similar to a neuron of a feedforward neural network. The convolution layer algorithm is as follows:

after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The step of selecting the pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled. It is generally represented in the form:

the convolutional neural network is usually a fully-connected layer upstream of the output layer, and thus has the same structure and operation principle as the output layer in the conventional feedforward neural network. For the human body action recognition problem, the output layer is the classification labels of different actions, and the concrete expression form is shown in fig. 2.

With respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates. The method of concatenating 2D coordinates into the same coordinate system is called a multi-angle information aggregation method. The invention provides a multi-angle information aggregation method, which is a novel multi-view human body coordinate system conversion method.

Multi-angle information aggregation method

The specific form of the multi-angle information aggregation method is algebraic triangle transformation. We can use the trigonometric transformation to treat each joint j separately. The method is built on a trigonometric transformation method in 2D coordinates, wherein the information of the human joint coordinates is from heat maps of different angles in a motion recognition framework. H_c,j＝h_θ(I_c)_jTo estimate 2D joint position information, we first compute the softmax layer on the spatial axis:

the parameter α will be discussed later, and then we calculate the central position of the 2D position information heat map of each node as the position estimate (so called soft-argmax) of the node.

An important feature of Soft-argmax is thatThe index of the maximum feature is not obtained, thus facilitating the heat map H_cGradient back propagation is performed. Since the two-dimensional body recognition framework is pre-trained using Loss. We adjust the joint heat in the map by multiplying the heat map by the inverse heat parameter α, so the start phase of the training process for soft-argmax outputs the maximum possible position.

To derive 2D joint position information x_c,jTo infer three-dimensional joint position information, we use a linear trigonometric transformation method. The method reduces the pair joint y_jThe search volume of 3D coordinates of the joint y, thereby solving the overdetermined system of equations on the homogeneous 3D coordinate vector of the joint y:

A_jy_j＝0

wherein

Is x_c,jThe projection matrix of (2).

The naive trigonometric transformation algorithm assumes that the joint coordinates of each view are independent of each other and therefore all contribute comparably to the trigonometric variation. However, on some views, the 2D position of the joint cannot be reliably estimated (e.g., due to joint occlusion), resulting in an unsatisfactory final triangulation result. This greatly exacerbates the tendency of methods that optimize algebraic reprojection errors to tend to be unbalanced in different directions. This problem can be solved by using RANSAC together with the Huber loss (for scoring the reprojection error corresponding to the internal error). However, this has a relative disadvantage. For example, using RANSAC may completely shut off the gradient flow to the exclusion of the camera. To solve the above-mentioned general knowledge, we add learnable weights w of the corresponding coefficient matrices at different angles_c。

w_j＝(ω_1,j,ω_2,j,…,ω_C,j) (ii) a The operator represents the Hadamard product. Weight ω_c,jIs a convolutional neural networkCollaterals of kidney meridian

The output result.

An overview of the method based on trigonometric transformation with learning confidence. The input to the method is a set of RGB images with known camera parameters. The 2D human recognition algorithm produces a heat map of the joints and confidence levels for the camera joints. By applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map. The 2D position and confidence are passed together to an algebraic triangulation module, which outputs the triangulated 3D pose. All modules allow the gradient to be propagated backwards so the model can be trained end-to-end.

With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent era. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by using a computer to finally obtain a high-level semantic interpretation and automatic judgment result of the human body posture. The human body posture recognition technology has wide application and development prospects in multiple fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, film and game action recording, military target recognition and the like. The human body posture is recognized based on the human body skeleton characteristics, the skeleton is a topological structure description mode of an object, and the human body posture recognition method is widely applied to the fields of road inquiry, path planning, characteristic recognition and the like. The main work target and work content of the invention are to find a framework which is simple and convenient to calculate. With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligence era. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by using a computer to finally obtain a high-level semantic interpretation and automatic judgment result of the human body posture. The human body posture recognition technology has wide application and development prospects in multiple fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, military target recognition and the like.

Regarding the connection between the skeletal tracking principle and our research, the common skeletal tracking principle simply uses the picture information of a single camera, and adopts a common CNN network to directly fit the picture information, and the effect completely depends on the richness of the data set. Because human limbs shelter from the scheduling problem, we have adopted many cameras to solve unseen limbs recognition problem, have improved the accuracy of recognition result through adopting high-accuracy 2D posture estimation and having converted into 3D's posture through triangle transform.

The present invention introduces two novel methods for multi-view 3D human pose estimation based on learnable trigonometric transformations, which achieve the most advanced performance on the human3.6m dataset. The proposed solution greatly reduces the number of views required to obtain high accuracy and generates a smooth gesture sequence on the CMU Panoptic dataset without any time processing, which can potentially improve the labeling problem of the target dataset. We speculate that this method is robust to occlusion and partial views of a person, since it has perspective capabilities in learning the person's pose. Another important advantage of this method is that it explicitly takes the camera parameters as an independent input. Finally, if the approximate location of the human is known, volume triangulation can also be generalized to monocular images, yielding results close to the latest techniques.

Claims

1. A3D human body action recognition algorithm under the condition of multiple visual angles is characterized in that: the human body action recognition algorithm is divided into single-view 3D posture estimation and multi-view 3D posture estimation:

with respect to single-view 3D pose estimation, divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately through a fully connected, convolved, or recursive deep neural network; the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the 3D human body action recognition algorithm uses a first method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine;

2. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:

the deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer depth structure, multidimensional data can be used as input of an input layer of the deep convolutional neural network, one-dimensional data or two-dimensional data is used as input and is transmitted to the input layer of the deep convolutional neural network, and a one-dimensional array is usually time sequence data; most of the two-dimensional arrays are gray level images; the input layer of the convolutional neural network is used for receiving the three-dimensional array of the RGB image;

3. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:

an important feature of Soft-argmax is that the index of the maximum feature is not obtained, which is convenient for the heat map H_cCarrying out gradient back propagation; the two-dimensional human body recognition framework uses Loss for pre-training, and joint heat in the graph is adjusted by multiplying the heat graph and the inverse heat parameter alphaIn degrees, the maximum possible position is output in the initial stage of the training process of soft-argmax;

A_jy_j＝0

wherein

Is x_c,jThe projection matrix of (2).

4. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:

w_j＝(ω_1,j,ω_2,j,…,ω_C,j)；

The operator represents the Hadamard product, weight ω_c,jIs a convolutional neural network

The output result is:

the input to the method is a set of RGB images with known camera parameters; 2D human recognition algorithms generate heat maps of joints and confidence levels of camera jointsBy applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map, and the 2D position together with the confidence is passed to an algebraic triangulation module which outputs the triangulated 3D pose, all modules allowing back-propagation gradients, so the model can be trained end-to-end.