CN114036969A - 3D human body action recognition algorithm under multi-view condition - Google Patents

3D human body action recognition algorithm under multi-view condition Download PDF

Info

Publication number
CN114036969A
CN114036969A CN202110280476.5A CN202110280476A CN114036969A CN 114036969 A CN114036969 A CN 114036969A CN 202110280476 A CN202110280476 A CN 202110280476A CN 114036969 A CN114036969 A CN 114036969A
Authority
CN
China
Prior art keywords
neural network
layer
joint
coordinates
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110280476.5A
Other languages
Chinese (zh)
Other versions
CN114036969B (en
Inventor
石昕
邵慧杨
翟庆庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110280476.5A priority Critical patent/CN114036969B/en
Publication of CN114036969A publication Critical patent/CN114036969A/en
Application granted granted Critical
Publication of CN114036969B publication Critical patent/CN114036969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 3D human body action recognition algorithm under the condition of multiple visual angles, which is divided into single-view 3D posture estimation and multi-view 3D posture estimation; with respect to single-view 3D pose estimation, it can be divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately through a deep neural network; the second category infers 3D coordinates directly from the image using convolutional neural networks; with respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates. The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data which can be understood by a user for display.

Description

3D human body action recognition algorithm under multi-view condition
Technical Field
The invention relates to the fields of computer vision identification, real-time data visualization and big data parallel processing, in particular to a 3D human body action identification algorithm under the condition of multiple visual angles.
Background
With the development and progress of the society, the role of the human behavior recognition technology in the society is more and more important, and the human behavior recognition technology has wide application scenes. The three-dimensional human body model reconstruction and action recognition are a hot spot in the current research in the field of computer vision, and the aim is to extract and analyze actions in a video through various image processing and recognition classification technologies, reasonably construct a complete three-dimensional human body model and judge actions of people in the video, so that useful information is obtained, and the method has very wide application. The human behavior recognition technology can be applied to the fields of video monitoring (environments such as schools, canteens and companies), man-machine interaction (scenes such as railway stations), automatic explanation of football or basketball sports and the like.
Furthermore, human gesture recognition is a very important area in the field of computer vision. Many different directions can be extended according to the final target and the different established hypothesis rules;
(1) two-dimensional or three-dimensional motion of a human body is predicted.
(2) Human motion is predicted from a single sequence or frame in the video.
(3) The human body motion is predicted from a single or multiple cameras.
In the invention, only the recognition of human body actions in a three-dimensional space within a fixed frame range is focused on under the condition of multiple cameras. From a wider perspective, the motion detection framework provided by the invention can be used as a unified recognition framework to recognize human body motions in 2D and 3D simultaneously.
The 3D human body action recognition is a basic problem in computer vision, and is usually applied to sports action recognition, computer-aided live broadcast, man-machine interaction, special effect making and the like. Most conventional algorithms today focus on single view 3D body motion prediction. Although many related tasks have been done recently by scholars, recognition of human motion under multiple camera conditions is far from being solved. Therefore, the invention provides a 3D human body motion recognition algorithm under the condition of multiple visual angles.
The human body action recognition under the condition of multiple visual angles has high research value for two reasons: first, in an outdoor complex scene, multi-view human motion recognition is not arguably the best motion recognition mode. This is because competing technologies such as marker-based motion capture and visual inertial methods have certain limitations, such as the inability to capture rich gesture representations (e.g., estimating hand and facial and limb gestures), and various other limitations. A disadvantage of the previous approach is that this work uses multi-view triangulation to construct data sets that rely on an excessive, almost impractical number of views to obtain 3D realistic motion of sufficient quality. This makes the collection of new data sets for 3D pose recognition very challenging and there is a strong need to reduce the number of views required for accurate triangulation. Secondly, in some cases, the algorithm can be directly used for tracking the human body posture in real time so as to achieve the final aim of recognizing the action. This is because the configuration of multiple cameras is becoming increasingly available in the context of various applications such as sports or computer-assisted living. In this case, the accuracy of modern multiview methods is comparable to the developed monocular method. Therefore, improving the accuracy of multi-view pose estimation from few views is a significant challenge in direct practical applications.
Disclosure of Invention
The invention aims to provide a 3D human body action recognition algorithm under the condition of multiple visual angles, which detects and recognizes actions related to human bodies by adopting a computer vision recognition algorithm and converts the actions into data which can be understood by users for display.
The technical scheme adopted by the invention is as follows: A3D human body action recognition algorithm under the condition of multiple visual angles is characterized in that: and after 2D posture estimation under multiple views, performing 3D posture estimation by adopting a multi-angle information aggregation method.
With respect to single-view 3D pose estimation, divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting of the 2D coordinates to 3D separately by a deep neural network (fully connected, convolved or recursive); the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the 3D human body action recognition algorithm uses a first method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine;
with respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates; the method of concatenating 2D coordinates into the same coordinate system is called a multi-angle information aggregation method.
The deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer deep structure, an input layer of the deep convolutional neural network can process multidimensional data, and an input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays are usually time sequence data; most of the two-dimensional arrays are gray level images; an input layer of the two-dimensional convolution neural network receives a three-dimensional array of RGB images;
the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; convolution kernels in the convolution layers comprise weight coefficients, the pooling layer does not comprise the weight coefficients, the convolution layers have the function of carrying out feature extraction on input data and comprise a plurality of convolution kernels, each element forming the convolution kernels corresponds to one weight coefficient and one deviation value and is similar to a neuron of a feedforward neural network; the convolution layer algorithm is as follows:
Figure BDA0002978076000000031
after the feature extraction is carried out on the convolutional layer, the output feature graph is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of an adjacent area; the step of selecting a pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled; it is generally represented in the form:
Figure BDA0002978076000000032
the output layer in the convolutional neural network is usually a fully-connected layer upstream, and the structure and the working principle of the fully-connected layer are the same as those of the output layer in the traditional feedforward neural network.
The multi-angle information aggregation method is a multi-view human body coordinate system conversion method, and the specific form is algebraic trigonometric transformation; each joint j is processed separately using a trigonometric transformation; the method is established on a trigonometric transformation method in 2D coordinates, wherein the information of the human joint coordinates is from heat maps of different angles in a motion recognition frame; hc,j=hθ(Ic)jTo estimate the 2D joint position information, the softmax layer on the spatial axis is first computed:
Figure BDA0002978076000000033
secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, wherein the central position is called soft-argmax;
Figure BDA0002978076000000034
an important feature of Soft-argmax is that the index of the maximum feature is not obtained, which is convenient for the heat map HcCarrying out gradient back propagation; the two-dimensional human body recognition framework uses Loss for pre-training, joint heat in the graph is adjusted by multiplying the heat graph by a reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;
from 2D joint position information xc,jInferring three-dimensional joint position information using a linear trigonometric transformation method that reduces the number of pairs of joints yjThe search amount of the 3D coordinates of (2) solves the overdetermined equation set on the homogeneous 3D coordinate vector of the joint y:
Ajyj=0
wherein
Figure BDA0002978076000000041
Is xc,jThe projection matrix of (2).
The linear triangular transformation method comprises the following steps: the joint coordinates of each view are assumed to be independent of each other and therefore all contribute comparably to the triangular variation; learnable weights w of corresponding coefficient matrices at different anglesc
Figure BDA0002978076000000042
wj=(ω1,j2,j,…,ωC,j) (ii) a Operator represents Hadamard product, weight omegac,jIs a convolutional neural network
Figure BDA0002978076000000043
The output result is:
Figure BDA0002978076000000044
the input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joints and confidence of the camera joints, from which 2D positions of the joints can be deduced by applying soft-argmax, the 2D positions and confidence together being passed to an algebraic trigonometric module which outputs a triangulated 3D pose, all modules allowing back-propagation gradients, so that the model can be trained end-to-end.
The advantages of the first category of single-view 3D pose estimation are: simple, fast, can train on motion capture data (with skeleton/view enhancement), and can switch 2D skeletons after training;
among the advantages of multi-view 3D pose estimation are: this approach may efficiently use information from different views and may train on motion capture data.
In fact, few studies currently use volumetric pose representations in multi-view settings. In particular, the non-projection into the volume and subsequent non-learnable aggregation of 2D keypoint probability heatmaps (obtained from pre-trained 2D keypoint detectors) are utilized. Our work differs in two ways. First, we process the information within the volume in a learnable way. Second, we train the network end-to-end, tuning the 2D backbone and alleviating the need for 2D heatmap interpretability. This allows several self-consistent posture hypotheses to be transferred from the 2D detector to the volume aggregation stage (not possible with previous designs).
There have also been studies using a multi-stage approach to extrapolate 3D poses from the coordinates of the 2D joints prior to the external 3D poses. In the first stage, the images of all views are passed through a deep convolutional neural network to obtain a heat map of the 2D joint. The locations of maxima in the heatmap are used together to infer the 3D pose by optimizing the potential coordinates in the 3D pose prior space. At each subsequent stage, the 3D pose is re-projected to all camera views and fused with the prediction from the previous layer (via the convolutional network). Next, the 3D pose is re-estimated based on the position of the heat map maxima, and the process is repeated. Such a procedure allows for correcting predictions of 2D joint heat maps through indirect global reasoning on human body posture. In contrast to our approach, there are studies that do not have a gradient flow from 3D prediction to 2D heatmaps, and therefore there is no direct signal to correct the prediction of 3D coordinates.
A3D human body action recognition algorithm under the condition of multiple visual angles is used for recognizing human body actions in a three-dimensional space within a fixed frame range under the condition of multiple cameras, an action detection frame can be used as a uniform recognition frame for recognizing human body actions in 2D and 3D at the same time, and 2D action recognition can be rapidly expanded to 3D action recognition through the frame. We use this framework to add human bones, joints, and various constraints in three-dimensional space, which are obtained from the picture.
Regarding the action recognition framework, suppose that the C cameras are synchronized to a unified global coordinate system by using a projection matrix, so that human body data in a scene can be conveniently acquired; our goal is to estimate the position y of the human body's three-dimensional joint point at time t for joint J e (1 …, J) in the global coordinate systemj,t. For each frame, we use the ready 2D bodyA detection algorithm or bounding box in the data set itself to crop the image. Then we use the cropped image IcPassed as training data to the deep convolutional neural network framework.
The framework for the deep convolutional neural network is represented by ResNet-152 (parameter weight is theta, network output is g)θ) A series of transposed convolution layers outputting intermediate heat maps (output f)θ) And a convolutional neural network (output h) that converts the intermediate heat map into an interpretable joint heat map using a kernel of 1 × 1 sizeθThe output dimensions and number of joints are the same).
The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data which can be understood by a user for display.
Drawings
FIG. 1 is a diagram illustrating a method for recognizing human body motion from multiple angles according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep convolutional neural network according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a multi-angle information aggregation method according to an embodiment of the present invention.
Detailed Description
The invention provides a 3D human body action recognition algorithm under the condition of multiple visual angles, which is characterized in that: the human body action recognition algorithm is divided into single-view 3D posture estimation and multi-view 3D posture estimation;
with respect to single-view 3D pose estimation, it can be divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately by a deep neural network (fully connected, convolved or recursive); the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the present invention uses a first type of method as a main frame, using a deep convolutional neural network as a high quality 2D pose estimation engine.
Deep convolutional neural network
The deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer deep structure, and is one of representative algorithms of deep learning. The deep convolutional neural network has a characterization learning capability, and can perform translation invariant classification on input information according to a hierarchical structure thereof, so that the deep convolutional neural network is also called a 'translation invariant artificial neural network'. In recent years, convolutional neural networks have been shown to be highly distinctive over individual image recognition tournaments. Therefore, the present invention uses a deep convolutional neural network as a 2D pose estimation engine, and a structural diagram of the convolutional neural network is shown in fig. 2.
The input layer of the deep convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays may be time sequence data; most of the two-dimensional arrays are gray level images; the input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images.
The hidden layers of the deep convolutional neural network comprise convolutional layers, pooling layers and full-link layer 3 common structures. In a common architecture, convolutional and pooling layers are characteristic of deep convolutional neural networks. The convolution kernels in the convolutional layers contain weight coefficients, while the pooling layers do not. The function of the convolutional layer is to extract the characteristics of input data, the convolutional layer internally comprises a plurality of convolutional kernels, and each element forming the convolutional kernels corresponds to a weight coefficient and a deviation amount, and is similar to a neuron of a feedforward neural network. The convolution layer algorithm is as follows:
Figure BDA0002978076000000061
after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The step of selecting the pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled. It is generally represented in the form:
Figure BDA0002978076000000071
the convolutional neural network is usually a fully-connected layer upstream of the output layer, and thus has the same structure and operation principle as the output layer in the conventional feedforward neural network. For the human body action recognition problem, the output layer is the classification labels of different actions, and the concrete expression form is shown in fig. 2.
With respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates. The method of concatenating 2D coordinates into the same coordinate system is called a multi-angle information aggregation method. The invention provides a multi-angle information aggregation method, which is a novel multi-view human body coordinate system conversion method.
Multi-angle information aggregation method
The specific form of the multi-angle information aggregation method is algebraic triangle transformation. We can use the trigonometric transformation to treat each joint j separately. The method is built on a trigonometric transformation method in 2D coordinates, wherein the information of the human joint coordinates is from heat maps of different angles in a motion recognition framework. Hc,j=hθ(Ic)jTo estimate 2D joint position information, we first compute the softmax layer on the spatial axis:
Figure BDA0002978076000000072
the parameter α will be discussed later, and then we calculate the central position of the 2D position information heat map of each node as the position estimate (so called soft-argmax) of the node.
Figure BDA0002978076000000073
An important feature of Soft-argmax is thatThe index of the maximum feature is not obtained, thus facilitating the heat map HcGradient back propagation is performed. Since the two-dimensional body recognition framework is pre-trained using Loss. We adjust the joint heat in the map by multiplying the heat map by the inverse heat parameter α, so the start phase of the training process for soft-argmax outputs the maximum possible position.
To derive 2D joint position information xc,jTo infer three-dimensional joint position information, we use a linear trigonometric transformation method. The method reduces the pair joint yjThe search volume of 3D coordinates of the joint y, thereby solving the overdetermined system of equations on the homogeneous 3D coordinate vector of the joint y:
Ajyj=0
wherein
Figure BDA0002978076000000081
Is xc,jThe projection matrix of (2).
The naive trigonometric transformation algorithm assumes that the joint coordinates of each view are independent of each other and therefore all contribute comparably to the trigonometric variation. However, on some views, the 2D position of the joint cannot be reliably estimated (e.g., due to joint occlusion), resulting in an unsatisfactory final triangulation result. This greatly exacerbates the tendency of methods that optimize algebraic reprojection errors to tend to be unbalanced in different directions. This problem can be solved by using RANSAC together with the Huber loss (for scoring the reprojection error corresponding to the internal error). However, this has a relative disadvantage. For example, using RANSAC may completely shut off the gradient flow to the exclusion of the camera. To solve the above-mentioned general knowledge, we add learnable weights w of the corresponding coefficient matrices at different anglesc
Figure BDA0002978076000000082
wj=(ω1,j2,j,…,ωC,j) (ii) a The operator represents the Hadamard product. Weight ωc,jIs a convolutional neural networkCollaterals of kidney meridian
Figure BDA0002978076000000083
The output result.
Figure BDA0002978076000000084
An overview of the method based on trigonometric transformation with learning confidence. The input to the method is a set of RGB images with known camera parameters. The 2D human recognition algorithm produces a heat map of the joints and confidence levels for the camera joints. By applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map. The 2D position and confidence are passed together to an algebraic triangulation module, which outputs the triangulated 3D pose. All modules allow the gradient to be propagated backwards so the model can be trained end-to-end.
With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent era. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by using a computer to finally obtain a high-level semantic interpretation and automatic judgment result of the human body posture. The human body posture recognition technology has wide application and development prospects in multiple fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, film and game action recording, military target recognition and the like. The human body posture is recognized based on the human body skeleton characteristics, the skeleton is a topological structure description mode of an object, and the human body posture recognition method is widely applied to the fields of road inquiry, path planning, characteristic recognition and the like. The main work target and work content of the invention are to find a framework which is simple and convenient to calculate. With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligence era. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by using a computer to finally obtain a high-level semantic interpretation and automatic judgment result of the human body posture. The human body posture recognition technology has wide application and development prospects in multiple fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, military target recognition and the like.
Regarding the connection between the skeletal tracking principle and our research, the common skeletal tracking principle simply uses the picture information of a single camera, and adopts a common CNN network to directly fit the picture information, and the effect completely depends on the richness of the data set. Because human limbs shelter from the scheduling problem, we have adopted many cameras to solve unseen limbs recognition problem, have improved the accuracy of recognition result through adopting high-accuracy 2D posture estimation and having converted into 3D's posture through triangle transform.
The present invention introduces two novel methods for multi-view 3D human pose estimation based on learnable trigonometric transformations, which achieve the most advanced performance on the human3.6m dataset. The proposed solution greatly reduces the number of views required to obtain high accuracy and generates a smooth gesture sequence on the CMU Panoptic dataset without any time processing, which can potentially improve the labeling problem of the target dataset. We speculate that this method is robust to occlusion and partial views of a person, since it has perspective capabilities in learning the person's pose. Another important advantage of this method is that it explicitly takes the camera parameters as an independent input. Finally, if the approximate location of the human is known, volume triangulation can also be generalized to monocular images, yielding results close to the latest techniques.

Claims (4)

1. A3D human body action recognition algorithm under the condition of multiple visual angles is characterized in that: the human body action recognition algorithm is divided into single-view 3D posture estimation and multi-view 3D posture estimation:
with respect to single-view 3D pose estimation, divided into two subcategories, the first category uses a high-quality 2D pose estimation engine followed by lifting the 2D coordinates to 3D separately through a fully connected, convolved, or recursive deep neural network; the second category infers 3D coordinates directly from the image using a deep convolutional neural network; the 3D human body action recognition algorithm uses a first method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine;
with respect to multi-view 3D pose estimation, aiming to obtain true annotations for monocular 3D body pose estimation, joint 2D coordinates in all views are concatenated into one batch as input to a fully connected network that is trained to predict global 3D joint coordinates; the method of concatenating 2D coordinates into the same coordinate system is called a multi-angle information aggregation method.
2. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:
the deep convolutional neural network is a feedforward neural network which comprises convolution calculation in mathematics and has a multilayer depth structure, multidimensional data can be used as input of an input layer of the deep convolutional neural network, one-dimensional data or two-dimensional data is used as input and is transmitted to the input layer of the deep convolutional neural network, and a one-dimensional array is usually time sequence data; most of the two-dimensional arrays are gray level images; the input layer of the convolutional neural network is used for receiving the three-dimensional array of the RGB image;
the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; convolution kernels in the convolution layers comprise weight coefficients, the pooling layer does not comprise the weight coefficients, the convolution layers have the function of carrying out feature extraction on input data and comprise a plurality of convolution kernels, each element forming the convolution kernels corresponds to one weight coefficient and one deviation value and is similar to a neuron of a feedforward neural network; the convolution layer algorithm is as follows:
Figure FDA0002978075990000011
after the feature extraction is carried out on the convolutional layer, the output feature graph is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of an adjacent area; the step of selecting a pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled; it is generally represented in the form:
Figure FDA0002978075990000021
the output layer in the convolutional neural network is usually a fully-connected layer upstream, and the structure and the working principle of the fully-connected layer are the same as those of the output layer in the traditional feedforward neural network.
3. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:
the multi-angle information aggregation method is a multi-view human body coordinate system conversion method, and the specific form is algebraic trigonometric transformation; each joint j is processed separately using a trigonometric transformation; the method is established on a trigonometric transformation method in 2D coordinates, wherein the information of the human joint coordinates is from heat maps of different angles in a motion recognition frame; hc,j=hθ(Ic)jTo estimate the 2D joint position information, the softmax layer on the spatial axis is first computed:
Figure FDA0002978075990000022
secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, wherein the central position is called soft-argmax;
Figure FDA0002978075990000023
an important feature of Soft-argmax is that the index of the maximum feature is not obtained, which is convenient for the heat map HcCarrying out gradient back propagation; the two-dimensional human body recognition framework uses Loss for pre-training, and joint heat in the graph is adjusted by multiplying the heat graph and the inverse heat parameter alphaIn degrees, the maximum possible position is output in the initial stage of the training process of soft-argmax;
from 2D joint position information xc,jInferring three-dimensional joint position information using a linear trigonometric transformation method that reduces the number of pairs of joints yjThe search amount of the 3D coordinates of (2) solves the overdetermined equation set on the homogeneous 3D coordinate vector of the joint y:
Ajyj=0
wherein
Figure FDA0002978075990000024
Is xc,jThe projection matrix of (2).
4. The 3D human motion recognition algorithm under the multi-view condition according to claim 1, characterized in that:
the linear triangular transformation method comprises the following steps: the joint coordinates of each view are assumed to be independent of each other and therefore all contribute comparably to the triangular variation; learnable weights w of corresponding coefficient matrices at different anglesc
Figure FDA0002978075990000031
wj=(ω1,j2,j,…,ωC,j);
Figure FDA0002978075990000032
The operator represents the Hadamard product, weight ωc,jIs a convolutional neural network
Figure FDA0002978075990000033
The output result is:
Figure FDA0002978075990000034
the input to the method is a set of RGB images with known camera parameters; 2D human recognition algorithms generate heat maps of joints and confidence levels of camera jointsBy applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map, and the 2D position together with the confidence is passed to an algebraic triangulation module which outputs the triangulated 3D pose, all modules allowing back-propagation gradients, so the model can be trained end-to-end.
CN202110280476.5A 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition Active CN114036969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110280476.5A CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110280476.5A CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Publications (2)

Publication Number Publication Date
CN114036969A true CN114036969A (en) 2022-02-11
CN114036969B CN114036969B (en) 2023-07-25

Family

ID=80134245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110280476.5A Active CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Country Status (1)

Country Link
CN (1) CN114036969B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863556A (en) * 2022-04-13 2022-08-05 上海大学 Multi-neural-network fusion continuous action recognition method based on skeleton posture
CN116310217A (en) * 2023-03-15 2023-06-23 精创石溪科技(成都)有限公司 Method for dynamically evaluating muscles in human body movement based on three-dimensional digital image correlation method
CN116403288A (en) * 2023-04-28 2023-07-07 中南大学 Motion gesture recognition method and device and electronic equipment

Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120163675A1 (en) * 2010-12-22 2012-06-28 Electronics And Telecommunications Research Institute Motion capture apparatus and method
US20130271458A1 (en) * 2012-04-11 2013-10-17 Disney Enterprises, Inc. Modeling human-human interactions for monocular 3d pose estimation
US20150278589A1 (en) * 2014-03-27 2015-10-01 Avago Technologies General Ip (Singapore) Pte. Ltd. Image Processor with Static Hand Pose Recognition Utilizing Contour Triangulation and Flattening
CN106780569A (en) * 2016-11-18 2017-05-31 深圳市唯特视科技有限公司 A kind of human body attitude estimates behavior analysis method
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
CN108389227A (en) * 2018-03-01 2018-08-10 深圳市唯特视科技有限公司 A kind of dimensional posture method of estimation based on multiple view depth perceptron frame
CN108460338A (en) * 2018-02-02 2018-08-28 北京市商汤科技开发有限公司 Estimation method of human posture and device, electronic equipment, storage medium, program
CN109087329A (en) * 2018-07-27 2018-12-25 中山大学 Human body three-dimensional joint point estimation frame and its localization method based on depth network
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN110427877A (en) * 2019-08-01 2019-11-08 大连海事大学 A method of the human body three-dimensional posture estimation based on structural information
CN110543581A (en) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) Multi-view three-dimensional model retrieval method based on non-local graph convolution network
CN110598590A (en) * 2019-08-28 2019-12-20 清华大学 Close interaction human body posture estimation method and device based on multi-view camera
CN110766746A (en) * 2019-09-05 2020-02-07 南京理工大学 3D driver posture estimation method based on combined 2D-3D neural network
CN111382300A (en) * 2020-02-11 2020-07-07 山东师范大学 Multi-view three-dimensional model retrieval method and system based on group-to-depth feature learning
US20200234398A1 (en) * 2019-01-22 2020-07-23 Fyusion, Inc Extraction of standardized images from a single view or multi-view capture
CN111523377A (en) * 2020-03-10 2020-08-11 浙江工业大学 Multi-task human body posture estimation and behavior recognition method
CN111583386A (en) * 2020-04-20 2020-08-25 清华大学 Multi-view human body posture reconstruction method based on label propagation algorithm
CN111738220A (en) * 2020-07-27 2020-10-02 腾讯科技(深圳)有限公司 Three-dimensional human body posture estimation method, device, equipment and medium
CN111815757A (en) * 2019-06-29 2020-10-23 浙江大学山东工业技术研究院 Three-dimensional reconstruction method for large component based on image sequence
US20200342270A1 (en) * 2019-04-26 2020-10-29 Tata Consultancy Services Limited Weakly supervised learning of 3d human poses from 2d poses
US10853970B1 (en) * 2019-03-22 2020-12-01 Bartec Corporation System for estimating a three dimensional pose of one or more persons in a scene
WO2020250046A1 (en) * 2019-06-14 2020-12-17 Wrnch Inc. Method and system for monocular depth estimation of persons
US20210019507A1 (en) * 2019-07-19 2021-01-21 Sri International Centimeter human skeleton pose estimation
CN112329513A (en) * 2020-08-24 2021-02-05 苏州荷露斯科技有限公司 High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network

Patent Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120163675A1 (en) * 2010-12-22 2012-06-28 Electronics And Telecommunications Research Institute Motion capture apparatus and method
US20130271458A1 (en) * 2012-04-11 2013-10-17 Disney Enterprises, Inc. Modeling human-human interactions for monocular 3d pose estimation
US20150278589A1 (en) * 2014-03-27 2015-10-01 Avago Technologies General Ip (Singapore) Pte. Ltd. Image Processor with Static Hand Pose Recognition Utilizing Contour Triangulation and Flattening
CN106780569A (en) * 2016-11-18 2017-05-31 深圳市唯特视科技有限公司 A kind of human body attitude estimates behavior analysis method
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
CN108460338A (en) * 2018-02-02 2018-08-28 北京市商汤科技开发有限公司 Estimation method of human posture and device, electronic equipment, storage medium, program
CN108389227A (en) * 2018-03-01 2018-08-10 深圳市唯特视科技有限公司 A kind of dimensional posture method of estimation based on multiple view depth perceptron frame
CN109087329A (en) * 2018-07-27 2018-12-25 中山大学 Human body three-dimensional joint point estimation frame and its localization method based on depth network
US20200234398A1 (en) * 2019-01-22 2020-07-23 Fyusion, Inc Extraction of standardized images from a single view or multi-view capture
US10853970B1 (en) * 2019-03-22 2020-12-01 Bartec Corporation System for estimating a three dimensional pose of one or more persons in a scene
US20200342270A1 (en) * 2019-04-26 2020-10-29 Tata Consultancy Services Limited Weakly supervised learning of 3d human poses from 2d poses
WO2020250046A1 (en) * 2019-06-14 2020-12-17 Wrnch Inc. Method and system for monocular depth estimation of persons
CN111815757A (en) * 2019-06-29 2020-10-23 浙江大学山东工业技术研究院 Three-dimensional reconstruction method for large component based on image sequence
US20210019507A1 (en) * 2019-07-19 2021-01-21 Sri International Centimeter human skeleton pose estimation
CN110427877A (en) * 2019-08-01 2019-11-08 大连海事大学 A method of the human body three-dimensional posture estimation based on structural information
CN110598590A (en) * 2019-08-28 2019-12-20 清华大学 Close interaction human body posture estimation method and device based on multi-view camera
CN110766746A (en) * 2019-09-05 2020-02-07 南京理工大学 3D driver posture estimation method based on combined 2D-3D neural network
CN110543581A (en) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) Multi-view three-dimensional model retrieval method based on non-local graph convolution network
CN111382300A (en) * 2020-02-11 2020-07-07 山东师范大学 Multi-view three-dimensional model retrieval method and system based on group-to-depth feature learning
CN111523377A (en) * 2020-03-10 2020-08-11 浙江工业大学 Multi-task human body posture estimation and behavior recognition method
CN111583386A (en) * 2020-04-20 2020-08-25 清华大学 Multi-view human body posture reconstruction method based on label propagation algorithm
CN111738220A (en) * 2020-07-27 2020-10-02 腾讯科技(深圳)有限公司 Three-dimensional human body posture estimation method, device, equipment and medium
CN112329513A (en) * 2020-08-24 2021-02-05 苏州荷露斯科技有限公司 High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROSS A. CLARK ET.AL: "Three-dimensional cameras and skeleton pose tracking for physical function assessment: A review of uses, validity, current developments and Kinect alternatives", 《GAIT & POSTURE》, vol. 68, pages 193 - 200 *
曹明伟: "数据驱动的多视图三维重建", 《中国博士学位论文全文数据库 信息科技辑》 *
陈秋敏: "基于深度学习的多视图物体三维重建研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863556A (en) * 2022-04-13 2022-08-05 上海大学 Multi-neural-network fusion continuous action recognition method based on skeleton posture
CN116310217A (en) * 2023-03-15 2023-06-23 精创石溪科技(成都)有限公司 Method for dynamically evaluating muscles in human body movement based on three-dimensional digital image correlation method
CN116310217B (en) * 2023-03-15 2024-01-30 精创石溪科技(成都)有限公司 Method for dynamically evaluating muscles in human body movement based on three-dimensional digital image correlation method
CN116403288A (en) * 2023-04-28 2023-07-07 中南大学 Motion gesture recognition method and device and electronic equipment

Also Published As

Publication number Publication date
CN114036969B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN110135375B (en) Multi-person attitude estimation method based on global information integration
CN107204010B (en) A kind of monocular image depth estimation method and system
CN111968217B (en) SMPL parameter prediction and human body model generation method based on picture
CN106780543B (en) A kind of double frame estimating depths and movement technique based on convolutional neural networks
CN114036969B (en) 3D human body action recognition algorithm under multi-view condition
WO2017133009A1 (en) Method for positioning human joint using depth image of convolutional neural network
CN113205595B (en) Construction method and application of 3D human body posture estimation model
CN111062326B (en) Self-supervision human body 3D gesture estimation network training method based on geometric driving
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN110399809A (en) The face critical point detection method and device of multiple features fusion
CN110781736A (en) Pedestrian re-identification method combining posture and attention based on double-current network
CN111199207B (en) Two-dimensional multi-human body posture estimation method based on depth residual error neural network
CN113989928B (en) Motion capturing and redirecting method
CN111598995B (en) Prototype analysis-based self-supervision multi-view three-dimensional human body posture estimation method
CN112258555A (en) Real-time attitude estimation motion analysis method, system, computer equipment and storage medium
CN111191630A (en) Performance action identification method suitable for intelligent interactive viewing scene
CN108830170A (en) A kind of end-to-end method for tracking target indicated based on layered characteristic
Liu Aerobics posture recognition based on neural network and sensors
CN114882493A (en) Three-dimensional hand posture estimation and recognition method based on image sequence
Yang et al. Human action recognition based on skeleton and convolutional neural network
Kurmankhojayev et al. Monocular pose capture with a depth camera using a Sums-of-Gaussians body model
CN115810219A (en) Three-dimensional gesture tracking method based on RGB camera
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning
CN112419387B (en) Unsupervised depth estimation method for solar greenhouse tomato plant image
CN114548224A (en) 2D human body pose generation method and device for strong interaction human body motion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant