CN114036969B - 3D human body action recognition algorithm under multi-view condition - Google Patents

3D human body action recognition algorithm under multi-view condition Download PDF

Info

Publication number
CN114036969B
CN114036969B CN202110280476.5A CN202110280476A CN114036969B CN 114036969 B CN114036969 B CN 114036969B CN 202110280476 A CN202110280476 A CN 202110280476A CN 114036969 B CN114036969 B CN 114036969B
Authority
CN
China
Prior art keywords
layer
joint
neural network
coordinates
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110280476.5A
Other languages
Chinese (zh)
Other versions
CN114036969A (en
Inventor
石昕
邵慧杨
翟庆庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110280476.5A priority Critical patent/CN114036969B/en
Publication of CN114036969A publication Critical patent/CN114036969A/en
Application granted granted Critical
Publication of CN114036969B publication Critical patent/CN114036969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a 3D human body action recognition algorithm under the condition of multiple visual angles, which is divided into single-view 3D gesture estimation and multiple-view 3D gesture estimation; regarding single view 3D pose estimation, it can be divided into two subcategories, the first class using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D respectively by a deep neural network; the second category uses convolutional neural networks to infer 3D coordinates directly from images; with respect to multi-view 3D pose estimation, it is intended to obtain a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch, as input to a fully connected network that is trained to predict global 3D joint coordinates. The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data display which can be understood by a user.

Description

3D human body action recognition algorithm under multi-view condition
Technical Field
The invention relates to the fields of computer vision recognition, real-time data visualization and big data parallel processing, in particular to a 3D human body action recognition algorithm under the condition of multiple visual angles.
Background
With the development and progress of society, the role played by human behavior recognition technology in society is more and more important, and the human behavior recognition technology has a wide application scene. The three-dimensional human body model reconstruction and motion recognition are hot spots in the current computer vision field research, and aim to extract and analyze the motion in the video through various image processing and recognition classification technologies, reasonably construct a complete three-dimensional human body model to judge the motion performed by the person in the video, thereby obtaining useful information, and having very wide application. The human behavior recognition technology can be applied to the fields of video monitoring (environments such as schools, canteens, companies and the like), man-machine interaction (scenes such as train stations and the like), football or basketball sport automatic explanation and the like.
Furthermore, human gesture recognition is a very important area in the field of computer vision. According to the differences formulated by the final target and the assumption rules, a plurality of different directions can be extended;
(1) Two-dimensional or three-dimensional motion of the human body is predicted.
(2) Human motion is predicted from a single sequence or frame in the video.
(3) Human motion is predicted from a single or multiple cameras.
In the invention, we only focus on the recognition of human body actions in three-dimensional space within the range of a fixed frame under the condition of multiple cameras. From a broader perspective, the motion detection framework provided by the invention can be used as a unified recognition framework to simultaneously recognize human motion in 2D and 3D.
The 3D human motion recognition is a fundamental problem in computer vision, and is applied to sports motion recognition, computer-aided live broadcast, man-machine interaction, special effect production and the like at ordinary times. Most conventional algorithms currently focus on 3D human motion prediction for a single view. Although many related works have been done recently by scholars, recognition of human motion under multi-camera conditions has not been addressed far. Therefore, the invention provides a 3D human motion recognition algorithm under the condition of multiple visual angles.
Human body action recognition under the condition of multiple visual angles has high research value, and two reasons are: first, in outdoor complex scenes, multi-view human motion recognition is the best motion recognition method which is indisputable. This is because competing technologies such as marker-based motion capture and visual inertia methods have limitations such as the inability to capture rich gesture representations (e.g., estimating hand and face gestures and limb gestures) and various other limitations. A disadvantage of the previous method is that the work uses multi-view triangulation to construct a dataset that relies on too many, almost impractical view numbers to obtain a 3D real action of sufficient quality. This makes the collection of new datasets for 3D gesture recognition very challenging, and there is an urgent need to reduce the number of views required for accurate triangulation. Secondly, in some cases, the algorithm can directly use the human body gesture tracking algorithm to track the human body gesture in real time so as to achieve the final purpose of identifying the action. This is because multi-camera configurations are becoming increasingly available in the context of various applications such as sports or computer-aided life. In this case, the accuracy of modern multi-view methods is comparable to developed monocular methods. Thus, improving the accuracy of multi-view pose estimation from few views is a significant challenge in direct practical applications.
Disclosure of Invention
The invention aims to provide a 3D human body action recognition algorithm under the condition of multiple visual angles, which is used for detecting and recognizing actions related to human bodies by adopting a computer vision recognition algorithm and converting the actions into data display which can be understood by users.
The technical scheme adopted by the invention is as follows: the 3D human motion recognition algorithm under the condition of multiple visual angles is characterized in that: and 3D gesture estimation is carried out by adopting a multi-angle information aggregation method after 2D gesture estimation under multiple views.
Regarding single view 3D pose estimation, two subcategories are divided, the first class using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D by a deep neural network (fully connected, convolved, or recursive), respectively; the second category uses deep convolutional neural networks to infer 3D coordinates directly from images; the 3D human motion recognition algorithm uses a first type of method as a main frame and uses a deep convolutional neural network as a high-quality 2D gesture estimation engine;
regarding multi-view 3D pose estimation, a true annotation aimed at obtaining monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch as input to a fully connected network trained to be able to predict global 3D joint coordinates; the method in which 2D coordinates are concatenated under the same coordinate system is called a multi-angle information aggregation method.
The deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, an input layer of the deep convolutional neural network can process multidimensional data, and an input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays are usually time sequence data; the two-dimensional array is mostly a gray scale map; an input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images;
the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; the convolution kernel in the convolution layer contains weight coefficients, the pooling layer does not contain weight coefficients, the function of the convolution layer is to perform feature extraction on input data, the convolution layer internally contains a plurality of convolution kernels, each element composing the convolution kernels corresponds to a weight coefficient and a deviation amount, and the convolution kernels are similar to neurons of a feedforward neural network; the algorithm of the convolution layer is as follows:
after the feature extraction is carried out on the convolution layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of the adjacent area; the pooling layer selects pooling areas which are the same as the step of the convolution kernel scanning characteristic diagram and are controlled by pooling size, step length and filling; the general expression form is:
the output layer upstream of the convolutional neural network is usually a fully-connected layer, and the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network.
The multi-angle information aggregation method is a multi-angle human body coordinate systemThe conversion method is concretely formed by algebraic triangular transformation; processing each joint j separately using a trigonometric transformation; the method is established on the triangle transformation method in the 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the action recognition frame; h c,j =h θ (I c ) j To estimate 2D joint position information, a softmax layer on the spatial axis is first calculated:
secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, which is called soft-argmax;
an important feature of Soft-argmax is that the index of the maximum feature is not obtained, and the heat map H is convenient c Carrying out gradient back propagation; the two-dimensional human body recognition frame uses Loss to pretrain, the joint heat in the graph is adjusted by multiplying the heat graph and the reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;
from 2D joint position information x c,j Inferring three-dimensional joint position information using a linear trigonometric transformation method that reduces the number of points of the joint y j The search amount of 3D coordinates of the joint y solves an overdetermined equation set on a homogeneous 3D coordinate vector of the joint y:
A j y j =0
wherein the method comprises the steps ofIs x c,j Is a projection matrix of the projection matrix.
The linear triangular transformation method comprises the following steps: assuming that the joint coordinates of each view are independent of each other, they all contribute comparably to the triangular variation; learner weights of corresponding coefficient matrix under different anglesw c
w j =(ω 1,j2,j ,…,ω C,j ) The method comprises the steps of carrying out a first treatment on the surface of the The operator of degree represents the Hadamard product, weight omega c,j Is a convolutional neural networkThe output result is: />The input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joint and confidence of the camera joint, by applying soft-argmax, the 2D position of the joint can be deduced from the 2D joint heat map, and the 2D position and confidence are passed together to an algebraic triangulation module that outputs a triangulated 3D pose, all modules allowing for back-propagation gradients, so the model can be trained end-to-end.
The advantages of the first class of single view 3D pose estimation are: simple, fast, training (with skeleton/view enhancement) on motion capture data, and switching 2D skeletons after training;
among the advantages of multi-view 3D pose estimation are: this approach can effectively use information from different views and can train on motion capture data.
In fact, few of the current mainstream studies use volumetric pose representations in a multi-view setup. In particular, non-projected to volume and subsequent non-learnable aggregation using 2D keypoint probability heat maps (obtained from pre-trained 2D keypoint detectors). Our work differs in two ways. First, we process the information within the volume in a learnable manner. Second, we perform end-to-end training on the network, thereby tuning the 2D backbone and alleviating the need for 2D heatmap interpretability. This allows transferring several self-consistent pose assumptions from the 2D detector to the volume aggregation phase (previous designs were not possible).
There have also been studies using a multi-stage method to infer a 3D pose from the coordinates of the 2D joint prior to an external 3D pose. In the first stage, the images of all views are transferred through a deep convolutional neural network to obtain a heat map of the 2D joint. The maximum locations in the heat map are used together to infer the 3D pose by optimizing the potential coordinates in the 3D pose a priori space. At each subsequent stage, the 3D pose is re-projected to all camera views and fused with predictions from the previous layer (through the convolutional network). Next, the 3D pose is re-estimated from the position of the heat map maximum, and then the process is repeated. Such a procedure allows correction of predictions of the 2D joint heat map by indirect global reasoning about human gestures. In contrast to our approach, there is no study of the gradient flow from 3D prediction to 2D heat maps, and therefore no direct signal to correct the prediction of 3D coordinates.
A3D human body motion recognition algorithm under the condition of multiple visual angles is used for recognizing human body motion in a three-dimensional space within the range of a fixed frame under the condition of multiple cameras, the motion detection frame can be used as a unified recognition frame to simultaneously recognize human body motion in 2D and 3D, and 2D motion recognition can be quickly expanded to 3D motion recognition through the frame. We use this framework to add human bones, joints, and various constraints from the pictures in three-dimensional space.
Regarding the action recognition framework, assume that we have synchronized C cameras to a unified global coordinate system using a projection matrix, to facilitate obtaining human data in a scene; our goal is to estimate the position y of the three-dimensional joint point of the human body of joint J e (1 …, J) at time t in the global coordinate system j,t . For each frame we use an off-the-shelf 2D human detection algorithm or a bounding box in the dataset that is self-contained to crop the image. Subsequently we use the cropped image I c As training data to the deep convolutional neural network framework.
The deep convolutional neural network framework is defined by ResNet-152 (parameter weight θ, network output g θ ) In a series of outputsTransposed convolution layer of the inter-heat map (output is f θ ) And a convolutional neural network (output h) using a kernel of size 1 x 1 to convert the intermediate heat map into an interpretable joint heat map θ The output dimension and the number of joints are the same).
The invention has the advantages that: the 3D human body action recognition algorithm under the condition of multiple visual angles is provided, and the actions related to the human body are detected and recognized by adopting a computer vision recognition algorithm and are converted into data display which can be understood by a user.
Drawings
FIG. 1 is a schematic diagram of a method for recognizing human motion from multiple angles in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep convolutional neural network in accordance with one embodiment of the present invention;
fig. 3 is a schematic structural diagram of a multi-angle information aggregation method according to an embodiment of the present invention.
Detailed Description
The invention provides a 3D human motion recognition algorithm under the condition of multiple visual angles, which is characterized in that: the human body action recognition algorithm is divided into single-view 3D gesture estimation and multi-view 3D gesture estimation;
regarding single view 3D pose estimation, it can be divided into two subcategories, the first using a high quality 2D pose estimation engine, followed by lifting the 2D coordinates to 3D respectively by a deep neural network (fully connected, convolved or recursive); the second category uses deep convolutional neural networks to infer 3D coordinates directly from images; the invention uses a first type of method as a main frame and uses a deep convolutional neural network as a high-quality 2D attitude estimation engine.
Deep convolutional neural network
The deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, and is one of representative algorithms of deep learning. The deep convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the deep convolutional neural network, so the deep convolutional neural network is also called as a 'translation invariant artificial neural network'. In recent years, convolutional neural networks have amplified highlights on individual image recognition events. Therefore, the present invention uses a deep convolutional neural network as a 2D pose estimation engine, and a structural diagram about the convolutional neural network is shown in fig. 2.
The input layer of the deep convolutional neural network can process multidimensional data, and commonly, the input layer of the one-dimensional convolutional neural network receives one-dimensional or two-dimensional arrays or even three-dimensional data, wherein the one-dimensional arrays can be time sequence data generally; the two-dimensional array is mostly a gray scale map; the input layer of the two-dimensional convolutional neural network receives a three-dimensional array of RGB images.
The hidden layers of the deep convolutional neural network comprise convolutional layers, pooling layers and full-connection layer 3 common structures. In a common construction, the convolutional layer and the pooling layer are specific to the deep convolutional neural network. The convolution kernels in the convolution layer contain weight coefficients, while the pooling layer does not. The function of the convolution layer is to perform feature extraction on the input data, and the convolution layer internally contains a plurality of convolution kernels, wherein each element constituting the convolution kernels corresponds to a weight coefficient and a deviation amount, and is similar to neurons of a feedforward neural network. The algorithm of the convolution layer is as follows:
after the feature extraction is performed by the convolution layer, the output feature map is transferred to the pooling layer for feature selection and information filtering. The pooling layer contains a predefined pooling function that functions to replace the results of individual points in the feature map with the feature map statistics of its neighboring regions. The pooling layer selects pooling area and the step of the convolution kernel scanning characteristic diagram are the same, and the pooling area, step length and filling are controlled. The general expression form is:
the output layer upstream of the convolutional neural network is usually a fully-connected layer, so that the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network. For the human motion recognition problem, the output layer is a classification label of different motions, and the specific expression form is shown in fig. 2.
With respect to multi-view 3D pose estimation, it is intended to obtain a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch, as input to a fully connected network that is trained to predict global 3D joint coordinates. The method in which 2D coordinates are concatenated under the same coordinate system is called a multi-angle information aggregation method. The multi-angle information aggregation method is a novel multi-angle human body coordinate system conversion method provided by the invention.
Multi-angle information aggregation method
The multi-angle information aggregation method is concretely formed by algebraic triangular transformation. We can use a trigonometric transformation to process each joint j individually. The method is based on a triangle transformation method in 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the motion recognition frame. H c,j =h θ (I c ) j To estimate 2D joint position information, we first calculate the softmax layer on the spatial axis:
the parameter α will be discussed later, and then we calculate the center position of the 2D position information heat map of each node as the position estimate (called soft-argmax) for that node.
An important feature of Soft-argmax is that the index of the largest feature is not obtained, which facilitates the heat map H c Gradient back propagation is performed. Because the two-dimensional human recognition framework is pre-trained using Loss. We adjust the joint heat in the map by multiplying the heat map and the inverted heat parameter α so that the start of the training process of soft-argmax outputs the most likely position.
To obtain 2D joint position information x c,j Three-dimensional joint position information is inferred, and we use a linear trigonometric transformation method. The method reduces the articulation y j Thereby solving the system of overdetermined equations on the homogeneous 3D coordinate vector of the joint y:
A j y j =0
wherein the method comprises the steps ofIs x c,j Is a projection matrix of the projection matrix.
The naive triangulation algorithm assumes that the joint coordinates of each view are independent of each other and thus all contribute comparably to the triangulation. However, on some views, the 2D position of the joint cannot be estimated reliably (e.g., due to joint occlusion), resulting in an unsatisfactory final triangulation result. This greatly exacerbates the tendency of methods that optimize algebraic reprojection errors to be prone to imbalance levels in different directions. This problem can be solved by using RANSAC together with Huber losses (for scoring the re-projection errors corresponding to internal errors). However, this has a relative disadvantage. For example, using RANSAC may completely cut off the gradient flow to the exclusion of the camera. To solve the above mentioned text, we add a learnable weight w of the corresponding coefficient matrix at different angles c
w j =(ω 1,j2,j ,…,ω C,j ) The method comprises the steps of carrying out a first treatment on the surface of the The ° operator represents the Hadamard product. Weight omega c,j Is a convolutional neural networkThe output result.
An overview of a trigonometric transformation method based on learning confidence. The input to the method is a set of RGB images with known camera parameters. The 2D human recognition algorithm generates a heat map of the joint and a confidence level of the camera joint. By applying soft-argmax, the 2D position of the joint can be inferred from the 2D joint heat map. The 2D position is passed along with the confidence level to an algebraic triangulation module that outputs a triangulated 3D pose. All modules allow for back-propagation gradients so the model can be trained end-to-end.
With respect to the application scenario of the present invention, along with the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent times. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by utilizing a computer, and finally obtaining a high-level semantic interpretation and automatic judgment result of the human body posture. The human body gesture recognition technology has wide application and development prospects in the fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, film and game action recording, military target recognition and the like. The human body posture is identified based on human body skeleton characteristics, and the skeleton is a topological structure description mode of an object, and is widely applied to the fields of road interrogation, path planning, characteristic identification and the like. The main working object and the working content of the invention are to find a frame which is easy to calculate. With the rapid development of modern network technology and computer technology, people gradually move to the information and intelligent times. The human body posture recognition technology is a process of processing, analyzing and understanding an input video or image sequence by utilizing a computer, and finally obtaining a high-level semantic interpretation and automatic judgment result of the human body posture. The human body gesture recognition technology has wide application and development prospects in the fields of intelligent building monitoring, moving object analysis, virtual reality, perception interfaces, military target recognition and the like.
Regarding the connection between the skeleton tracking principle and our research, the common skeleton tracking principle simply uses the picture information of a single camera, and the common CNN network is adopted to directly fit the picture information, so that the effect is completely dependent on the richness of the data set. Due to the problems of human body limb shielding and the like, the problem of invisible limb recognition is solved by adopting a plurality of cameras, and the accuracy of a recognition result is improved by adopting high-accuracy 2D gesture estimation and converting the high-accuracy 2D gesture into a 3D gesture through triangular transformation.
The present invention introduces two novel approaches to multi-view 3D human pose estimation based on a learnable trigonometric transformation that achieve the most advanced performance on the human3.6m dataset. The proposed solution greatly reduces the number of views required to obtain high accuracy and generates a smooth sequence of poses on the CMU panotic dataset without any time processing, which can potentially improve the labeling problem of the target dataset. We speculate that this approach is robust to occlusion and partial views of a person because of its perspective capability in learning the pose of the person. Another important advantage of this approach is that it explicitly takes camera parameters as independent inputs. Finally, if the approximate location of a human is known, the volumetric triangulation can also be generalized to a monocular image, producing results approaching those of the state of the art.

Claims (2)

1. The 3D human motion recognition algorithm under the condition of multiple visual angles is characterized in that:
multi-view 3D pose estimation aimed at obtaining a true annotation of monocular 3D human pose estimation, concatenating joint 2D coordinates in all views into one batch as input to a fully connected network trained to be able to predict global 3D joint coordinates; the method in which 2D coordinates are connected in series in the same coordinate system is called a multi-angle information aggregation method;
the multi-angle information aggregation method is a multi-angle human body coordinate system conversion method, and the specific form is algebraic triangular transformation; processing each joint j separately using a trigonometric transformation; the method is established on the triangle transformation method in the 2D coordinates, wherein the information of the human joint coordinates comes from heat maps of different angles in the action recognition frame; h c,j =h θ (I c ) j
To estimate 2D joint position information, a softmax layer on the spatial axis is first calculated:
secondly, calculating the central position of the 2D position information heat map of each node as the position estimation of the node, which is called soft-argmax;
one important feature of Soft-argmax is that the index of the maximum feature is not acquired, so that gradient back propagation of a heat map is facilitated; the two-dimensional human body recognition frame uses Loss to pretrain, the joint heat in the graph is adjusted by multiplying the heat graph and the reverse heat parameter alpha, and the maximum possible position is output at the beginning stage of the training process of soft-argmax;
from 2D joint position information x c,j The three-dimensional joint position information is deduced, and a linear triangular transformation method is used, so that the searching amount of 3D coordinates of the joint j is reduced, and an overdetermined equation set on a homogeneous 3D coordinate vector of the joint j is solved:
A j y j =0;
wherein the method comprises the steps ofIs x c,j Is a projection matrix of (a);
the linear triangular transformation method comprises the following steps: assuming that the joint coordinates of each view are independent of each other, they all contribute comparably to the triangular variation; the learnable weights of the corresponding coefficient matrixes under different angles;
w j =(ω 1,j ,ω 2,j ,...,ω C,j ):the operator represents the Hadamard product, the weight omega c,j Is convolutional neural network->The output result is: />The input to the method is a set of RGB images with known camera parameters; the 2D human recognition algorithm generates a heat map of the joint and confidence of the camera joint, by applying soft-argmax, the 2D position of the joint can be deduced from the 2D joint heat map, and the 2D position and confidence are passed together to an algebraic triangulation module that outputs a triangulated 3D pose, all modules allowing for back-propagation gradients, so the model can be trained end-to-end.
2. The 3D human motion recognition algorithm in a multi-view situation according to claim 1, wherein:
the deep convolutional neural network is a feedforward neural network which comprises convolutional calculation in mathematics and has a multi-layer deep structure, multidimensional data can be used as input of an input layer of the deep convolutional neural network, one-dimensional data or two-dimensional data are used as input to be transmitted to the input layer of the deep convolutional neural network, and a one-dimensional array is usually time sequence data; the two-dimensional array is mostly a gray scale map; the input layer of the convolutional neural network adopted by the invention receives the three-dimensional array of the RGB image;
the hidden layer of the deep convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 type structure; the convolution kernel in the convolution layer contains weight coefficients, the pooling layer does not contain weight coefficients, the function of the convolution layer is to perform feature extraction on input data, the convolution layer internally contains a plurality of convolution kernels, each element composing the convolution kernels corresponds to a weight coefficient and a deviation amount, and the convolution kernels are similar to neurons of a feedforward neural network; the algorithm of the convolution layer is as follows:
after the feature extraction is carried out on the convolution layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering; the pooling layer comprises a preset pooling function, and the function of the pooling layer is to replace the result of a single point in the feature map with the feature map statistic of the adjacent area; the pooling layer selects pooling areas which are the same as the step of the convolution kernel scanning characteristic diagram and are controlled by pooling size, step length and filling; the general expression form is:
the output layer upstream of the convolutional neural network is usually a fully-connected layer, and the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network.
CN202110280476.5A 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition Active CN114036969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110280476.5A CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110280476.5A CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Publications (2)

Publication Number Publication Date
CN114036969A CN114036969A (en) 2022-02-11
CN114036969B true CN114036969B (en) 2023-07-25

Family

ID=80134245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110280476.5A Active CN114036969B (en) 2021-03-16 2021-03-16 3D human body action recognition algorithm under multi-view condition

Country Status (1)

Country Link
CN (1) CN114036969B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863556A (en) * 2022-04-13 2022-08-05 上海大学 Multi-neural-network fusion continuous action recognition method based on skeleton posture
CN116310217B (en) * 2023-03-15 2024-01-30 精创石溪科技(成都)有限公司 Method for dynamically evaluating muscles in human body movement based on three-dimensional digital image correlation method
CN116403288A (en) * 2023-04-28 2023-07-07 中南大学 Motion gesture recognition method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
CN110543581A (en) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) Multi-view three-dimensional model retrieval method based on non-local graph convolution network
CN111382300A (en) * 2020-02-11 2020-07-07 山东师范大学 Multi-view three-dimensional model retrieval method and system based on group-to-depth feature learning
CN111815757A (en) * 2019-06-29 2020-10-23 浙江大学山东工业技术研究院 Three-dimensional reconstruction method for large component based on image sequence
US10853970B1 (en) * 2019-03-22 2020-12-01 Bartec Corporation System for estimating a three dimensional pose of one or more persons in a scene

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101587962B1 (en) * 2010-12-22 2016-01-28 한국전자통신연구원 Motion capture apparatus and method
US9058663B2 (en) * 2012-04-11 2015-06-16 Disney Enterprises, Inc. Modeling human-human interactions for monocular 3D pose estimation
RU2014111793A (en) * 2014-03-27 2015-10-10 ЭлЭсАй Корпорейшн PROCESSOR OF PROCESSING IMAGES WITH RECOGNITION OF STATIC POSES OF HAND USING TRIANGULATION AND SMOOTHING OF CIRCUITS
CN106780569A (en) * 2016-11-18 2017-05-31 深圳市唯特视科技有限公司 A kind of human body attitude estimates behavior analysis method
US10824862B2 (en) * 2017-11-14 2020-11-03 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN108460338B (en) * 2018-02-02 2020-12-11 北京市商汤科技开发有限公司 Human body posture estimation method and apparatus, electronic device, storage medium, and program
CN108389227A (en) * 2018-03-01 2018-08-10 深圳市唯特视科技有限公司 A kind of dimensional posture method of estimation based on multiple view depth perceptron frame
CN109087329B (en) * 2018-07-27 2021-10-15 中山大学 Human body three-dimensional joint point estimation framework based on depth network and positioning method thereof
US11783443B2 (en) * 2019-01-22 2023-10-10 Fyusion, Inc. Extraction of standardized images from a single view or multi-view capture
EP3731185A1 (en) * 2019-04-26 2020-10-28 Tata Consultancy Services Limited Weakly supervised learning of 3d human poses from 2d poses
CA3046612A1 (en) * 2019-06-14 2020-12-14 Wrnch Inc. Method and system for monocular depth estimation of persons
US11263443B2 (en) * 2019-07-19 2022-03-01 Sri International Centimeter human skeleton pose estimation
CN110427877B (en) * 2019-08-01 2022-10-25 大连海事大学 Human body three-dimensional posture estimation method based on structural information
CN110598590A (en) * 2019-08-28 2019-12-20 清华大学 Close interaction human body posture estimation method and device based on multi-view camera
CN110766746B (en) * 2019-09-05 2022-09-06 南京理工大学 3D driver posture estimation method based on combined 2D-3D neural network
CN111523377A (en) * 2020-03-10 2020-08-11 浙江工业大学 Multi-task human body posture estimation and behavior recognition method
CN111583386B (en) * 2020-04-20 2022-07-05 清华大学 Multi-view human body posture reconstruction method based on label propagation algorithm
CN111738220B (en) * 2020-07-27 2023-09-15 腾讯科技(深圳)有限公司 Three-dimensional human body posture estimation method, device, equipment and medium
CN112329513A (en) * 2020-08-24 2021-02-05 苏州荷露斯科技有限公司 High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
US10853970B1 (en) * 2019-03-22 2020-12-01 Bartec Corporation System for estimating a three dimensional pose of one or more persons in a scene
CN111815757A (en) * 2019-06-29 2020-10-23 浙江大学山东工业技术研究院 Three-dimensional reconstruction method for large component based on image sequence
CN110543581A (en) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) Multi-view three-dimensional model retrieval method based on non-local graph convolution network
CN111382300A (en) * 2020-02-11 2020-07-07 山东师范大学 Multi-view three-dimensional model retrieval method and system based on group-to-depth feature learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Three-dimensional cameras and skeleton pose tracking for physical function assessment: A review of uses, validity, current developments and Kinect alternatives;Ross A. Clark et.al;《Gait & Posture》;第68卷;193-200 *

Also Published As

Publication number Publication date
CN114036969A (en) 2022-02-11

Similar Documents

Publication Publication Date Title
CN114036969B (en) 3D human body action recognition algorithm under multi-view condition
CN107204010B (en) A kind of monocular image depth estimation method and system
CN109800689B (en) Target tracking method based on space-time feature fusion learning
WO2017133009A1 (en) Method for positioning human joint using depth image of convolutional neural network
CN111968129A (en) Instant positioning and map construction system and method with semantic perception
Liu et al. Improved human action recognition approach based on two-stream convolutional neural network model
CN107871106A (en) Face detection method and device
CN107563494A (en) A kind of the first visual angle Fingertip Detection based on convolutional neural networks and thermal map
CN109190508A (en) A kind of multi-cam data fusion method based on space coordinates
CN110399809A (en) The face critical point detection method and device of multiple features fusion
CN110472542A (en) A kind of infrared image pedestrian detection method and detection system based on deep learning
CN113205595B (en) Construction method and application of 3D human body posture estimation model
Zhou et al. Learning to estimate 3d human pose from point cloud
CN110781736A (en) Pedestrian re-identification method combining posture and attention based on double-current network
CN104794737A (en) Depth-information-aided particle filter tracking method
CN108830170A (en) A kind of end-to-end method for tracking target indicated based on layered characteristic
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN114724185A (en) Light-weight multi-person posture tracking method
CN111680560A (en) Pedestrian re-identification method based on space-time characteristics
CN114689038A (en) Fruit detection positioning and orchard map construction method based on machine vision
Chen et al. Improving registration of augmented reality by incorporating DCNNS into visual SLAM
Yang et al. Human action recognition based on skeleton and convolutional neural network
Zhou et al. Mh pose: 3d human pose estimation based on high-quality heatmap
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant