CN116403243A - Multi-view multi-person 3D gesture estimation method, device and equipment - Google Patents

Multi-view multi-person 3D gesture estimation method, device and equipment Download PDF

Info

Publication number
CN116403243A
CN116403243A CN202310434486.9A CN202310434486A CN116403243A CN 116403243 A CN116403243 A CN 116403243A CN 202310434486 A CN202310434486 A CN 202310434486A CN 116403243 A CN116403243 A CN 116403243A
Authority
CN
China
Prior art keywords
point cloud
boundary
human body
features
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310434486.9A
Other languages
Chinese (zh)
Inventor
徐枫
王利猛
滕达
王文通
贺京杰
刘强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Petrochemical Technology
Original Assignee
Beijing Institute of Petrochemical Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Petrochemical Technology filed Critical Beijing Institute of Petrochemical Technology
Priority to CN202310434486.9A priority Critical patent/CN116403243A/en
Publication of CN116403243A publication Critical patent/CN116403243A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Abstract

The invention relates to the technical field of posture estimation, in particular to a multi-view multi-person 3D posture estimation method, a device and equipment, wherein the method comprises the following steps: acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology; according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted; carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system; performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary; mapping the 2D semantic features to a 3D point cloud to obtain 3D features; and classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundaries. According to the technical scheme, point cloud segmentation aiming at different human bodies can be carried out according to the 3D point cloud, the joint points of different people can be accurately defined, the different people can be distinguished and matched, and meanwhile, based on discrete point cloud data, the method has better flexibility and fault tolerance.

Description

Multi-view multi-person 3D gesture estimation method, device and equipment
Technical Field
The invention relates to the technical field of posture estimation, in particular to a multi-view multi-person 3D posture estimation method, device and equipment.
Background
The main task of 3D human body posture estimation is to accurately predict 3D structure information of a human body in a three-dimensional space, and the implementation of the technical means is to introduce depth information on the basis of a 2D posture estimation result. The 3D human body posture estimation expression mode is accurate and visual, and has very high research value. However, the actual scene is often mixed by multiple people and objects, and besides the self-shielding problem caused by mutual influence between joints of the human body, the mutual shielding problem caused by interaction with the environment is also caused. In order to reduce the strong influence of occlusion on the attitude estimation as much as possible, one common method is to use a plurality of cameras to cover the scene in all directions, and use the information acquired by other cameras to make up the missing information of the current camera. Therefore, effective and reasonable integration of information acquired from multiple perspectives is an important point of 3D multi-body posture estimation.
However, due to the existence of the occlusion problem, the existing 3D multi-human body posture estimation method is insufficient in definition accuracy of the joint positions of different human bodies, insufficient in accuracy of distinguishing and matching of different human bodies, and the conventional 3D multi-human body posture estimation method needs to rely on a mathematical formula for matching, is insufficient in flexibility and low in fault tolerance.
Therefore, the existing 3D multi-human body posture estimation method is insufficient in accuracy, insufficient in flexibility and low in fault tolerance.
Disclosure of Invention
In view of the above, the present invention aims to provide a multi-view multi-person 3D posture estimation method, apparatus and device, so as to solve the problems of insufficient accuracy, insufficient flexibility and low fault tolerance of the 3D multi-person posture estimation method in the prior art.
According to a first aspect of an embodiment of the present invention, there is provided a multi-view multi-person 3D pose estimation method, including:
acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology;
according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted;
performing point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary;
mapping the 2D semantic features to the 3D point cloud to obtain 3D features;
and classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundary.
Preferably, the performing character segmentation of a single view and extracting to obtain 2D semantic features includes:
identifying detection frames of different human bodies for the image information of each visual angle;
estimating a 2D thermodynamic diagram for each detection frame using a 2D pose estimation network;
obtaining a 2D coordinate of each joint node and a probability map thereof according to the 2D thermodynamic diagram;
and obtaining 2D semantic features by using a deep convolutional neural network according to the probability map.
Preferably, the performing point cloud segmentation with targets of different human bodies according to the 3D point cloud, calculating to obtain a point cloud boundary as a human body boundary includes:
and processing the 3D point cloud by using a core attention mechanism of a transducer architecture, dividing the point cloud with targets of different human bodies, and calculating to obtain a point cloud boundary as a human body boundary.
Preferably, the mapping the 2D semantic feature to the 3D point cloud to obtain a 3D feature includes:
according to the 2D coordinates, the 2D semantic features and the pre-stored camera parameter position information of each joint node of each view angle, mapping all joint nodes to the 3D point cloud by utilizing a epipolar geometry principle to obtain the 3D features of each view angle;
and fusing the joint nodes in the 3D features of all view angles to obtain the 3D features including all joint node information at the current moment and the prior connection relation of the joint nodes.
Preferably, the fusing the joint nodes in the 3D features of all views includes:
and fusing joint nodes of the same human body under each view angle according to a preset radius threshold by using a radius type space linear interpolation fusion algorithm.
Preferably, the classifying and 3D pose constructing of different human bodies according to the 3D features and the human body boundaries includes:
dividing the information of all joint nodes according to the human body boundaries to obtain joint nodes in each human body boundary;
and connecting joint nodes in each human body boundary according to the prior connection relation to construct a 3D gesture.
Preferably, the calculating to obtain the point cloud boundary includes:
and calculating the curved surface boundary of the N clusters of point clouds at the X distance from the joint node according to the 3D point clouds after the point cloud segmentation, and taking the curved surface boundary as a human body boundary.
Preferably, after the 3D pose is constructed, the method further includes:
correcting the constructed 3D gesture based on the principle of human inverse kinematics so as to enable the 3D gesture to conform to ergonomic constraint.
According to a second aspect of an embodiment of the present invention, there is provided a multi-view multi-person 3D pose estimation apparatus including:
the information acquisition module is used for acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology;
the semantic feature module is used for carrying out character segmentation of a single view according to the image information and extracting 2D semantic features;
the point cloud construction module is used for carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
the point cloud segmentation module is used for carrying out point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary;
the feature mapping module is used for mapping the 2D semantic features to the 3D point cloud to obtain 3D features;
and the 3D gesture construction module is used for classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundaries.
According to a third aspect of an embodiment of the present invention, there is provided a multi-view multi-person 3D pose estimation apparatus including:
a master controller and a memory connected with the master controller;
the memory, in which program instructions are stored;
the master is configured to execute program instructions stored in the memory and perform the method of any of the above.
The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:
it can be understood that the technical scheme provided by the invention utilizes the multi-view acquisition technology to acquire the image information and depth information of the shooting target; according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted; carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system; performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary; mapping the 2D semantic features to a 3D point cloud to obtain 3D features; and classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundaries. According to the technical scheme, point cloud segmentation aiming at different human bodies can be carried out according to the 3D point cloud, the joint points of different people can be accurately defined, the different people can be distinguished and matched, and meanwhile, based on discrete point cloud data, the method has better flexibility and fault tolerance.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram illustrating steps of a multi-view multi-person 3D pose estimation method according to an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a semantic feature extraction flow according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating semantic segmentation according to an example embodiment;
FIG. 4 is a schematic diagram of a 3D gesture construction flow shown in accordance with an exemplary embodiment;
fig. 5 is a schematic block diagram of a multi-view multi-person 3D pose estimation apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
At present, the multi-view multi-person 3D gesture estimation method is layered endlessly, the following four main stream comparison methods are briefly described, and the defects of the corresponding methods are analyzed.
The first method is matching+triangularization reconstruction: the method needs to obtain 2D gestures of multiple people under different view angles, then match the frames, calculate correlation matrixes of all people between adjacent frames, and represent the probability of whether the people are the same person under different view angles by using the correlation matrixes. Finally, a 3D gesture is obtained by inference by using a triangulation or 3Dpictorial structure (3 DPs) method. The disadvantage of this approach is that it too much depends on the 2D detector, and the accuracy of the detector will directly affect the probability calculation of the subsequent correlation matrix. And the correlation between the characters is calculated instead of the data driving, and the contingency is relatively large.
The second method is spatial voxelization: the method firstly predicts a key point thermodynamic diagram according to a 2D image under multiple view angles, then converts the thermodynamic diagram into a 3D space into a spatial voxel characteristic quantity and positions the central point of all people under the scene according to the spatial voxel characteristic quantity. And finally predicting the positions of other human body joint points according to the central point. The method has the disadvantage that the precision of space voxelization is limited by the size of the grid, and the computational complexity increases with the space size to the third power. Furthermore, complex occlusion scenes can cause very large disturbances to the positioning of the center point.
The third method is a transducer architecture: the method can avoid complex calculation pipelines, lighten calculation load, acquire real-time reasoning and directly carry out node spatial position regression. However, there is a common problem due to the transducer architecture: a large number of data sets are required for training. However, the applicable data sets are limited, so that the self-supervision method is preferably introduced for improvement.
Example 1
Fig. 1 is a schematic step diagram of a multi-view multi-person 3D pose estimation method according to an exemplary embodiment, and referring to fig. 1, a multi-view multi-person 3D pose estimation method is provided, including:
step S11, acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology;
step S12, according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted;
step S13, carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
step S14, performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary;
step S15, mapping the 2D semantic features to the 3D point cloud to obtain 3D features;
and S16, classifying different human bodies and constructing 3D gestures according to the 3D features and the human body boundaries.
It can be understood that, in the technical solution provided in this embodiment, image information and depth information of a shooting target are obtained by using a multi-view acquisition technology; according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted; carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system; performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary; mapping the 2D semantic features to a 3D point cloud to obtain 3D features; and classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundaries. According to the technical scheme, point cloud segmentation aiming at different human bodies can be carried out according to the 3D point cloud, the joint points of different people can be accurately defined, the different people can be distinguished and matched, and meanwhile, based on discrete point cloud data, the method has better flexibility and fault tolerance.
Firstly, single-view 2D multi-human body target detection and joint semantic feature extraction are carried out, and the method specifically comprises the following steps:
in step S12, the step of performing character segmentation of the single view and extracting 2D semantic features includes:
identifying detection frames of different human bodies for the image information of each visual angle;
estimating a 2D thermodynamic diagram for each detection frame using a 2D pose estimation network;
obtaining a 2D coordinate of each joint node and a probability map thereof according to the 2D thermodynamic diagram;
and obtaining 2D semantic features by using a deep convolutional neural network according to the probability map.
Preferably, the acquired image information may be an RGB image.
In specific practice, fig. 2 is a schematic diagram of a semantic feature extraction flow, see fig. 2, according to an exemplary embodiment, for implementing single view character segmentation and semantic feature extraction on an acquired RGB image based on some existing models. In recent years, yolo-series object detectors have been widely used in the backbone network for pose estimation with their light-weight and accurate detection effects. The model can be trained from source code, and the structure can be slightly modified during use to more closely approximate the ideal result.
The specific method comprises the following steps: for each view angle RGB image, using yolo detector as main network, to identify different human detection frame. A 2D thermodynamic diagram (hetmap) is then estimated for each detection box using the 2D pose estimation network. The thermodynamic diagram represents the coordinates of each bone node with a probability map, the resolution of which is often an equal scale map of the original map (typically 64x 48), the number of bone nodes being equal to the number of joint nodes. And finally, estimating semantic information of each node of the 2D gesture by using a DCNN (deep convolutional neural network). In the 2D articulation point and its semantics, there are represented head, hand-1 (hand numbered 1), foot-1 (foot numbered 1), and the like.
Then, multi-view 3D point cloud fusion and human class-based instance segmentation are needed, and the method specifically comprises the following steps:
in step S13, according to the depth information, a point cloud reconstruction is performed to obtain a 3D point cloud under a preset world coordinate system.
In step S14, the performing, according to the 3D point cloud, point cloud segmentation with targets of different human bodies, and calculating a point cloud boundary as a human body boundary includes:
and processing the 3D point cloud by using a core attention mechanism of a transducer architecture, dividing the point cloud with targets of different human bodies, and calculating to obtain a point cloud boundary as a human body boundary.
In specific practice, for the problem of character matching, point cloud reconstruction is carried out on depth information acquired from multiple perspectives, 3D point cloud is established, and then point cloud segmentation is carried out on the point cloud, wherein the targets of the point cloud are different human bodies. The purpose is to find different body model boundaries as confidence intervals for the body posture. Thus, the situation that the same person is matched according to the appearance characteristics and the shape characteristic network of the person is avoided, and different persons are represented by the segmented point cloud.
FIG. 3 is a flow chart illustrating semantic segmentation, see FIG. 3, according to an exemplary embodiment by: for the depth image of each view, the depth information point cloud mapping of the plurality of views is converted into a 3D point cloud in the world coordinate system. Preferably, the point cloud can be preprocessed, including the operations of point cloud fusion, outlier rejection and the like. The most important link is how to segment the human body instance for the point cloud. The main stream point cloud segmentation network such as PointNet, DGCNN and the like has insignificant segmentation edges, and if the main stream point cloud segmentation network is used as the point cloud segmentation model of the embodiment, the human body example may be not clearly segmented, and the calculated point cloud boundary is not good naturally. Because of the irregularity and disorder of the point cloud data, the point cloud cannot be directly processed by the convolutional neural network before. It is very inconvenient if deep learning is desired to handle the point cloud related tasks. The present embodiment utilizes a transducer architecture, and mainly utilizes its core attention mechanism to process the point cloud. Because of the point cloud processing, an operator which is unchanged in arrangement and does not depend on the connection relation between the points needs to be designed; the attention mechanism itself is such an operator. Human point cloud instance segmentation using a Transformer architecture may perform better than conventional segmentation networks. The human body point cloud segmentation is performed based on a transducer architecture.
Further, 2D feature mapping and human skeleton reconstruction in a 3D point cloud boundary are performed, specifically as follows:
in step S15, mapping the 2D semantic feature to the 3D point cloud to obtain a 3D feature includes:
according to the 2D coordinates, the 2D semantic features and the pre-stored camera parameter position information of each joint node of each view angle, mapping all joint nodes to the 3D point cloud by utilizing a epipolar geometry principle to obtain the 3D features of each view angle;
and fusing the joint nodes in the 3D features of all view angles to obtain the 3D features including all joint node information at the current moment and the prior connection relation of the joint nodes.
In step S16, the classifying and 3D pose construction of different human bodies according to the 3D features and the human body boundaries includes:
dividing the information of all joint nodes according to the human body boundaries to obtain joint nodes in each human body boundary;
and connecting joint nodes in each human body boundary according to the prior connection relation to construct a 3D gesture.
In specific practice, mapping the 2D features to a 3D point cloud space to obtain 3D features, and classifying different human bodies and reconstructing 3D human body skeletons according to the 3D features and the point cloud boundary. The segmented point cloud boundary represents the human body boundary, the characteristics comprise joint point information at the current moment, and the 3D gesture is obtained by connecting all 3D characteristic points containing prior connection relations in the point cloud boundary.
It should be noted that the fusing the joint nodes in the 3D features of all views includes:
and fusing joint nodes of the same human body under each view angle according to a preset radius threshold by using a radius type space linear interpolation fusion algorithm.
FIG. 4 is a schematic diagram of a 3D gesture construction flow shown according to an exemplary embodiment, see FIG. 4, in which the viewing angle i is V i Person j is P j The 3D joint point k is G k . V due to the limitation of shooting range i Lower P j May or may not occur; considering the occlusion effect, P j G of (2) k May be absent. These all result in G in the final cloud space k Is less than N x M (N is P j Number, M is the number of cameras). Ideal situation, i.e. humanThe object is not blocked and appears at the visual angle of each camera to enable G k The number is equal to N x M. First, G for the same person with multiple angles of view according to the nearby relation k Fusion is performed. Assume that G of the same person after epipolar mapping k G of different persons with definite ratio k Closely spaced, inspired by a radius outlier rejection algorithm, a radius threshold R may be preset for P at each view angle j G of (2) k Limiting a sphere with radius R, and fusing the same semantic nodes in the sphere (such as by adopting a spatial linear interpolation method). The threshold R is set so that good results can be obtained. For all view angles V i G below k Performing radial space linear interpolation fusion to generate G' k The method comprises the following steps:
G′ k =f(V i ,G k ),0<i≤M
f is a radius type linear interpolation fusion algorithm; m is the number of cameras.
Now 3D joint G' k Only semantic information, i.e. only information tags for head, neck and hands. Next, it is necessary to put G 'of a different person' k And (5) separating. Calculating a point cloud boundary (BPA algorithm or Poisson reconstruction) by using the segmented 3D point cloud, wherein G 'in the boundary' k Namely P j A kind of electronic device. Finally, the joint points are connected according to the semantic relation (head and neck, hand and elbow and the like), and the 3D gesture (3D skeleton) can be obtained.
It should be noted that the calculating to obtain the point cloud boundary includes:
and calculating the curved surface boundary of the N clusters of point clouds at the X distance from the joint node according to the 3D point clouds after the point cloud segmentation, and taking the curved surface boundary as a human body boundary.
In specific practice, since computing a complete point cloud boundary consumes a large amount of computational effort and not all boundaries are useful, it may be optimized to reconstruct only the segmented G 'when computing the point cloud boundary' k N clusters of point clouds nearby (the number of the point cloud clusters to be reserved by each node is designated as N, and the distance G 'of each cluster of point clouds is selected' k Nearest n points) and determine G' k On a curved surfaceEither the inside or the outside of this is referred to as G' k Which P belongs to j
It should be noted that, after the 3D pose is constructed, the method further includes:
correcting the constructed 3D gesture based on the principle of human inverse kinematics so as to enable the 3D gesture to conform to ergonomic constraint.
After the 3D gesture is constructed, the 3D gesture obtained by optimization can be continuously optimized, and the human body inverse kinematics principle is introduced to correct the 3D gesture. The 3D skeleton can be parameterized by the principle of human body inverse kinematics, and skeleton parameters can be optimized according to a related formula, so that a final skeleton model is more in accordance with the principle of human body engineering and is more visual and natural.
The greatest challenge in multi-view multi-person 3D pose estimation is the multi-person matching problem. The information loss of the joint points, which is generated by the shielding of the person due to the shooting angle, needs to be reasonably compensated. Aiming at the core problem, the invention defines the joint points of different people by using the point cloud boundary by means of a point cloud instance segmentation algorithm, and realizes the distinguishing and matching of different people. Compared with the traditional matching method relying on a mathematical formula, the method based on the discrete point cloud data driving has greater flexibility and fault tolerance. In addition, the optimization method of the point cloud boundary can artificially reduce the required calculation force and ensures the real-time performance to a certain extent. The method does not need to additionally train a model, does not depend on a large data set, can directly use the latest target detector at present as the original input of the whole set of method, and solves the problem of multi-person matching through the above-mentioned algorithm.
The invention uses the down-top frame, has simple flow and is easy to infer. The existence of multiple views makes depth information of other views complement when one view is blocked. According to the invention, only one boundary is limited by using the point cloud, the priori is acquired, and the point cloud boundary is not completely used as the final output, so that the shielding treatment effect is relatively good. In the last step of the invention, the semantic information of the 3D joint points is used for directly reconstructing the three-dimensional human skeleton connection, and the method has very high speed. In order to output proper skeleton posture, a physical model is introduced to correct the skeleton posture, so that the skeleton posture is more in accordance with an ergonomic principle.
Example two
Fig. 5 is a schematic block diagram of a multi-view multi-person 3D pose estimation apparatus according to an exemplary embodiment, referring to fig. 5, there is provided a multi-view multi-person 3D pose estimation apparatus including:
an information acquisition module 101, configured to acquire image information and depth information of a shooting target by using a multi-view acquisition technology;
the semantic feature module 102 is configured to perform character segmentation of a single view according to the image information, and extract 2D semantic features;
the point cloud construction module 103 is configured to perform point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
the point cloud segmentation module 104 is configured to perform point cloud segmentation for different human bodies according to the 3D point cloud, and calculate a point cloud boundary as a human body boundary;
a feature mapping module 105, configured to map the 2D semantic feature to the 3D point cloud, to obtain a 3D feature;
and the 3D pose construction module 106 is configured to classify different human bodies and construct a 3D pose according to the 3D features and the human body boundary.
It can be understood that, in the technical solution provided in this embodiment, the information acquisition module 101 acquires, by using a multi-view acquisition technology, image information and depth information of a shooting target; the semantic feature module 102 performs character segmentation of a single view according to the image information, and extracts 2D semantic features; performing point cloud reconstruction according to the depth information through a point cloud construction module 103 to obtain a 3D point cloud under a preset world coordinate system; performing point cloud segmentation aiming at different human bodies according to the 3D point cloud through a point cloud segmentation module 104, and calculating to obtain a point cloud boundary as a human body boundary; mapping the 2D semantic features to the 3D point cloud through a feature mapping module 105 to obtain 3D features; classification and 3D pose construction of different human bodies are performed by the 3D pose construction module 106 according to the 3D features and human body boundaries. According to the technical scheme, point cloud segmentation aiming at different human bodies can be carried out according to the 3D point cloud, the joint points of different people can be accurately defined, the different people can be distinguished and matched, and meanwhile, based on discrete point cloud data, the method has better flexibility and fault tolerance.
Example III
Provided is a multi-view multi-person 3D pose estimation device including:
a master controller and a memory connected with the master controller;
the memory, in which program instructions are stored;
the master is configured to execute program instructions stored in the memory and perform the method of any of the above.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. A multi-view multi-person 3D pose estimation method, comprising:
acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology;
according to the image information, character segmentation of a single view is carried out, and 2D semantic features are extracted;
performing point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
performing point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary;
mapping the 2D semantic features to the 3D point cloud to obtain 3D features;
and classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundary.
2. The method of claim 1, wherein the performing character segmentation of the single view and extracting 2D semantic features comprises:
identifying detection frames of different human bodies for the image information of each visual angle;
estimating a 2D thermodynamic diagram for each detection frame using a 2D pose estimation network;
obtaining a 2D coordinate of each joint node and a probability map thereof according to the 2D thermodynamic diagram;
and obtaining 2D semantic features by using a deep convolutional neural network according to the probability map.
3. The method according to claim 2, wherein the performing point cloud segmentation for different human bodies according to the 3D point cloud, and calculating a point cloud boundary as a human body boundary includes:
and processing the 3D point cloud by using a core attention mechanism of a transducer architecture, dividing the point cloud with targets of different human bodies, and calculating to obtain a point cloud boundary as a human body boundary.
4. A method according to claim 3, wherein said mapping the 2D semantic features to the 3D point cloud results in 3D features comprising:
according to the 2D coordinates, the 2D semantic features and the pre-stored camera parameter position information of each joint node of each view angle, mapping all joint nodes to the 3D point cloud by utilizing a epipolar geometry principle to obtain the 3D features of each view angle;
and fusing the joint nodes in the 3D features of all view angles to obtain the 3D features including all joint node information at the current moment and the prior connection relation of the joint nodes.
5. The method of claim 4, wherein fusing joint nodes in the 3D features of all perspectives comprises:
and fusing joint nodes of the same human body under each view angle according to a preset radius threshold by using a radius type space linear interpolation fusion algorithm.
6. The method of claim 4, wherein the classifying and 3D pose constructing of different human bodies according to the 3D features and the human body boundaries comprises:
dividing the information of all joint nodes according to the human body boundaries to obtain joint nodes in each human body boundary;
and connecting joint nodes in each human body boundary according to the prior connection relation to construct a 3D gesture.
7. The method of claim 5, wherein the computing a point cloud boundary comprises:
and calculating the curved surface boundary of the N clusters of point clouds at the X distance from the joint node according to the 3D point clouds after the point cloud segmentation, and taking the curved surface boundary as a human body boundary.
8. The method of claim 1, further comprising, after the 3D pose construction:
correcting the constructed 3D gesture based on the principle of human inverse kinematics so as to enable the 3D gesture to conform to ergonomic constraint.
9. A multi-view multi-person 3D pose estimation device, comprising:
the information acquisition module is used for acquiring image information and depth information of a shooting target by utilizing a multi-view acquisition technology;
the semantic feature module is used for carrying out character segmentation of a single view according to the image information and extracting 2D semantic features;
the point cloud construction module is used for carrying out point cloud reconstruction according to the depth information to obtain a 3D point cloud under a preset world coordinate system;
the point cloud segmentation module is used for carrying out point cloud segmentation aiming at different human bodies according to the 3D point cloud, and calculating to obtain a point cloud boundary as a human body boundary;
the feature mapping module is used for mapping the 2D semantic features to the 3D point cloud to obtain 3D features;
and the 3D gesture construction module is used for classifying different human bodies and constructing 3D gestures according to the 3D characteristics and the human body boundaries.
10. A multi-view multi-person 3D pose estimation apparatus, comprising:
a master controller and a memory connected with the master controller;
the memory, in which program instructions are stored;
the master is configured to execute program instructions stored in a memory and to perform the method of any one of claims 1 to 8.
CN202310434486.9A 2023-04-21 2023-04-21 Multi-view multi-person 3D gesture estimation method, device and equipment Pending CN116403243A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310434486.9A CN116403243A (en) 2023-04-21 2023-04-21 Multi-view multi-person 3D gesture estimation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310434486.9A CN116403243A (en) 2023-04-21 2023-04-21 Multi-view multi-person 3D gesture estimation method, device and equipment

Publications (1)

Publication Number Publication Date
CN116403243A true CN116403243A (en) 2023-07-07

Family

ID=87012246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310434486.9A Pending CN116403243A (en) 2023-04-21 2023-04-21 Multi-view multi-person 3D gesture estimation method, device and equipment

Country Status (1)

Country Link
CN (1) CN116403243A (en)

Similar Documents

Publication Publication Date Title
CN110458939B (en) Indoor scene modeling method based on visual angle generation
CN106803267B (en) Kinect-based indoor scene three-dimensional reconstruction method
Li et al. Monocular real-time volumetric performance capture
Zheng et al. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus
CN107392964B (en) The indoor SLAM method combined based on indoor characteristic point and structure lines
CN109636831B (en) Method for estimating three-dimensional human body posture and hand information
Joo et al. Panoptic studio: A massively multiview system for social motion capture
Stoll et al. Fast articulated motion tracking using a sums of gaussians body model
CN109544677A (en) Indoor scene main structure method for reconstructing and system based on depth image key frame
CN112991413A (en) Self-supervision depth estimation method and system
CN111160164A (en) Action recognition method based on human body skeleton and image fusion
CN109829972B (en) Three-dimensional human standard skeleton extraction method for continuous frame point cloud
Krejov et al. Combining discriminative and model based approaches for hand pose estimation
WO2021063271A1 (en) Human body model reconstruction method and reconstruction system, and storage medium
Alexiadis et al. Fast deformable model-based human performance capture and FVV using consumer-grade RGB-D sensors
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
Chen et al. 3D reconstruction of unstructured objects using information from multiple sensors
Pham et al. Robust real-time performance-driven 3D face tracking
Xu et al. 3D joints estimation of the human body in single-frame point cloud
CN113255514B (en) Behavior identification method based on local scene perception graph convolutional network
Dong et al. Learning stratified 3D reconstruction
CN116403243A (en) Multi-view multi-person 3D gesture estimation method, device and equipment
CN114663917A (en) Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device
Zhang et al. Object detection based on deep learning and b-spline level set in color images
Jian et al. Realistic face animation generation from videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination