CN115131504A

CN115131504A - Multi-person three-dimensional reconstruction method under wide-field-of-view large scene

Info

Publication number: CN115131504A
Application number: CN202210778162.2A
Authority: CN
Inventors: 李坤; 崔慧丽; 温浩; 黄敬
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-30

Abstract

The invention discloses a multi-person three-dimensional reconstruction method under a wide-field-of-view large scene, and relates to the technical field of three-dimensional vision. The invention provides a multi-person three-dimensional reconstruction method under a wide view field and a large scene, which is based on an end-to-end large-scene single-image multi-person three-dimensional reconstruction frame, and aiming at a billion-pixel-level large-scene image, a scale self-adaptive hierarchical representation scheme with a man-made center is designed; estimating scene-level camera internal parameters and a public ground by using the 2D joint points; a progressive positioning method of ground guidance is provided, global 3D positioning of scene level is converted into local 2D positioning and 3D offset, accurate global space positioning of multiple persons in a scene is realized, and the problem of depth ambiguity under single-color camera acquisition is solved; obtaining SMPL parameters, 2D positions and 3D offsets required by human body shape and position estimation by utilizing a plurality of branch networks; and scene-level fine adjustment is performed in the testing stage, so that the position prediction precision of people in a new scene is effectively improved.

Description

Multi-person three-dimensional reconstruction method under wide-field-of-view large scene

Technical Field

The invention belongs to the technical field of three-dimensional vision, and relates to a multi-person three-dimensional reconstruction method under a wide-field-of-view large scene.

Background

The three-dimensional reconstruction of the human body refers to the recovery of human posture and shape from an input picture or video, and the provided geometric and motion information has wide application in games and movies. With the development of the fields of deep learning, computer vision, computer graphics and the like, the human body three-dimensional reconstruction technology based on images becomes a hot point of research in computer vision. Due to the difficulties of 3D labeling, data sets for multi-person three-dimensional reconstruction are mostly acquired by methods of data synthesis or in a laboratory environment. The existing research is carried out based on small and medium scene data sets, and analysis on wide-field and large-scene data is lacked. Wide field of view large scene data can provide rich spatial information and more closely resemble real-world scenes. The method has the advantages that the monocular multi-person reconstruction is carried out in a large scene, particularly in a scene of hundreds of persons, and the scene understanding and the crowd analysis are facilitated. In addition to the three-dimensional pose and shape of a person, the precise three-dimensional spatial position of a person is also critical to analyzing interpersonal relationships and crowd behavior.

Although a wide-field-of-view large-scene data set is closer to a real-world scene, the existing multi-person reconstruction method is a research based on a medium-small scene data set. Hongsuk et al (Hongsuk C, Gyeong gsik M, JoonKyu P, et al, learning to estimate road 3d human mesh from In-the-world wide scenes [ C ]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2022) propose a two-stage method 3DCrowdNet, which uses two-dimensional poses to distinguish different persons and uses joint-based regressors to estimate body model parameters, which focuses more on the accuracy of pose and shape, but ignores the three-dimensional spatial position of the person; to obtain consistent multi-person Reconstruction results, Jiang et al (Jiang W, Kolotouros N, Pavlakos G, et al. coherent Reconstruction of Multiple human from a Single Image [ C ]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Registration (CVPR),2020) propose CRmodels based on fast R-CNN (Ren S, He, Girshick R, et al. fast R-CNN: handware real-time detection with region knowledge networks [ J ]. Advances In neural Information Processing Systems (IPS), first detecting persons In an Image, then returning SMPL parameters of the person, relatively accurate positional relationships are obtained using penetration losses and depth ordering losses during the training process, but this method calculates the depth of the person based on the assumption that the person's height is consistent, which will estimate a greater depth for short people; to resolve the inherent ambiguity in height and Depth, ungrinovic et al (ungrinovic N, Ruiz a, Agudo a, et al, body Size and Depth differentiation in Multi-Person Reconstruction from Single Images [ C ] International Conference on 3D Vision (3DV),2021) proposed a Multi-stage optimization-based approach to optimize the scale and 3D translation of a body mesh estimated by CRMH; a multi-stage approach can bring computational redundancy, Zhang et al (Zhang J, Yu D, lie J H, et al body documents as points [ C ] In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2021) propose a single stage method BMP that associates human depth with features of different scales; sun et al (Sun Y, Bao Q, Liu W, et al. monomer, one-stage, regression of multiple 3 d-side [ C ]. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),2021) proposed ROMP and extracted camera information and SMPL information from the body's central feature map, which is based on the assumption of weak perspective projection and can only infer the two-dimensional position of a person on an image; to further address the location issue, Sun et al (Sun Y, Liu W, Bao Q, et al, Putting scope in the place: singular regression of 3d scope in depth. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Registration (CVPR),2022) propose BEV using bird's-eye view representation to simultaneously infer body center and depth in the image; however, the above methods can only obtain relative depth and cannot obtain absolute position, and cannot be directly applied to large scenes.

Aiming at the problems, the invention provides an end-to-end-based large-scene single-image multi-person reconstruction frame, aiming at billion-pixel-level large-scene images, a scale self-adaptive hierarchical representation scheme with a human center is designed, a global and local combined representation model is constructed, the problem of depth ambiguity under single-color camera acquisition is solved, and multi-person posture and shape reconstruction with consistent global space is realized; a progressive positioning method of ground guidance is provided, global 3D positioning of scene level is converted into local 2D positioning and 3D offset by estimating scene level camera parameters and public ground, and accurate global space positioning of multiple persons in a scene is realized; and scene-level fine adjustment is performed in the testing stage, so that the position prediction precision of people in a new scene is effectively improved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the purpose of the invention is: aiming at the problem that the existing method can not obtain a multi-person reconstruction result with depth ordering consistency on a wide-field large-scene data set, an end-to-end large-scene single-image multi-person reconstruction frame is provided, aiming at billion-pixel-level large-scene images, a scale self-adaptive hierarchical representation scheme with a man-made center is designed, a global and local combined representation model is constructed, the problem of depth ambiguity under single-color camera acquisition is solved, and multi-person posture and shape reconstruction with consistent global space is realized; a progressive positioning method of ground guidance is provided, global 3D positioning of scene level is converted into local 2D positioning and 3D offset by estimating scene level camera parameters and public ground, and accurate global space positioning of multiple persons in a scene is realized; and scene-level fine adjustment is performed in the testing stage, so that the position prediction precision of people in a new scene is effectively improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-person three-dimensional reconstruction method under a wide-field-of-view large scene comprises the following steps:

s1, preprocessing the large scene image, obtaining the cutting image with different resolution by the self-adaptive hierarchical representation with human as the center, so that the human occupies a proper proportion in the cutting image, and scaling the cutting image to a uniform size for training the network on the basis of keeping the original length-width ratio of the image;

s2, estimating 2D joint points of the large scene image by using the existing 2D joint point estimation method, correcting the 2D joint points which are estimated wrongly or lacked by using a manual correction method, and estimating a ground equation and camera internal parameters by using the 2D joint points;

s3, training a network by using the cut image obtained by preprocessing in S1, wherein the network realizes feature extraction through a backbone network, and then three different branch networks are used for respectively carrying out human body detection, 2D position estimation and 3D offset and human body parameter model estimation;

s4, obtaining a rough 3D position of the human body by using the camera internal reference and the ground equation obtained in S2 and based on the 2D position obtained in S3 through a ground-guided progressive positioning method, and obtaining an accurate 3D position of the human body by combining the 3D offset obtained in S3;

s5, performing scene-level fine adjustment on the model in the testing stage, and performing multi-person reconstruction on a new scene image to obtain a better 2D projection result;

and S6, combining the multi-person reconstruction results of all the cut images, and removing the persons repeatedly estimated to obtain a multi-person reconstruction result with consistent global space under a wide field of view and a large scene.

Preferably, the preprocessing procedure described in S1 mainly includes the following steps:

s101, defining the height of the minimum person and the height of the maximum person in the large scene image as h respectively _min And h _max Defining the upper boundary and the lower boundary of a cutting area as s and e respectively, cutting the large scene image by using a square sliding window, wherein the length of the ith sliding window in the y direction is c _i C, so that the height of the person in the cropped image is half of the height of the cropped image ₁ ＝2×h _min The last sliding window in the y-direction, i.e. the nth sliding window, which is c long _n ＝c ₁ ×q ^n-1 And is

Wherein q is a proportionality coefficient;

the human-centric adaptive hierarchy described in S1 is represented as follows:

in order to ensure that each person can completely appear in the cut image, an overlapped sliding window is added between two adjacent sliding windows in the y direction, and the length of the overlapped sliding window is half of the sum of the lengths of the adjacent sliding windows;

s102, keeping the original aspect ratio of the cut images with different resolutions, unifying the cut images to (512 ) through a bicubic interpolation method, and filling the insufficient parts with 0.

Preferably, the estimation of the ground equation and the camera parameters in S2 mainly includes the following steps:

s201, estimating 2D joint points of a cut image by an RMPE method, carrying out manual correction on the 2D joint points with errors or missing estimation, combining obtained results to obtain 2D joint point information of a large scene image, filtering postures according to prior information, and only keeping standing postures;

s202, using a pinhole camera model with a focal length f (f ═ f) _x ＝f _y ) The principal point is the central point of the image, and the ground equation is N ^T P _G + D ═ 0, where

Is the ground normal line and N luminance ₂ D is a constant term, reflecting the position of the ground,

a point on the ground;

s203, defining the middle point of the left ankle point and the right ankle point as

Its projection point on the image is x _b ＝(u _b ,v _b ) The center points of the left and right shoulders are

Its projection point on the image is x _t ＝(u _t ,v _t ) Suppose X _b Is the groundAt a point above which a person stands on the ground and has a fixed height h, passing X _b And X _t Is parallel to the ground normal;

s204, obtaining the image according to the pinhole imaging principle

Wherein

Is x _b Is the homogeneous coordinate of (A), K is the camera internal reference matrix, Z _b Is X _b Depth of (d); because of X _b Is a point on the ground, satisfies N ^T X _b When + D is 0, we can get:

projected points of midpoints of left and right shoulders

The following equation can be used for calculation:

wherein Z _t Is X _t Depth of (d);

s205, solving the camera parameters and the ground equation by an optimization-based method, wherein the loss function of the ith person is as follows:

wherein L is _Cosine Denotes the cosine distance, λ _{Angle of rotation} ，λ _{Die length} Are the weights of the respective loss terms, respectively;

and S206, translating the obtained ground by 0.1 meter along the normal direction to obtain a real ground instead of the ground where the ankle is located.

Preferably, the specific implementation process of S3 is as follows:

s301, performing feature extraction on the input picture through a backbone network, and further inputting the obtained features into three different branch networks, wherein each branch network consists of two ResNet blocks and batch normalization;

s302, obtaining a human body central feature map by a first branch network, and representing the possibility of the central position of a human in the feature map by using a Gaussian kernel combined with a body scale;

s303, the second branch network obtains a 2D position characteristic diagram, 2D coordinates and 2D offset of left and right ankle points are estimated, and the sum of the middle point and the 2D offset of the left and right ankle points is the required 2D position;

s304, obtaining an SMPL and an offset characteristic diagram by the third branch network, and estimating posture and shape parameters and 3D offset of the SMPL;

s305, extracting corresponding parameters from the 2D position feature map, the SMPL and the offset feature map according to the position obtained by the human body center feature map, and obtaining a 2D position, an SMPL parameter and a 3D offset required by estimating the position and the posture of the human body;

s306, training a human body center feature map and a 2D position feature map to enable a subsequently learned human body grid to have a proper initial position, training the whole network after 20 iterations, and iterating the whole network for 70 iterations.

Preferably, the ground-guided progressive positioning method described in S4 mainly includes the following steps:

s401, defining a projection point of the center of the human body on the ground as a foot drop point

The projection of the foot point on the large scene image and the projection on the cutting image are respectively

And

p _{local part} I.e. the 2D position obtained from the 2D position profile, p ═ p _{Local part} +t _p Wherein

Is the pixel position of the upper left corner of the cropped image in the large scene image, because P is a point on the ground, the rough 3D position P can be calculated from the predicted camera parameters and ground equation as follows:

wherein

Is the homogeneous coordinate of p;

s402, passing through 3D position deviation delta _3d Obtaining an accurate 3D position, T _3D ＝-mean(J _Ankle )+Δ _3d + P, wherein J _Ankle The 3D coordinates of the left ankle and the right ankle are obtained through calculation by the following equation:

M _{camera with a camera module} ＝M+T _3D

Wherein M is a human body grid under an SMPL standard space;

and S403, performing L1 constraint on the most serious points penetrating through the ground in the human body grid under the camera coordinate system to prevent the phenomenon that people penetrate through the ground in a large scene.

Preferably, the penetration loss function corresponding to the penetration loss described in S403 is specifically as follows:

wherein v is _i ∈M _{Camera with a camera module} ，

Is v _i Of (G) is [ N ] ^T ,D] ^T 。

Preferably, the scene-level fine-tuning described in S5 is implemented as follows:

s501, obtaining a cut image of a new scene image through the preprocessing process of S1, and obtaining a corresponding ground equation and camera parameters through S2;

and S502, fixing most parts of the network described by S3, optimizing only two branch networks for obtaining the 2D position characteristic diagram and the SMPL and offset characteristic diagram, and obtaining a scene-level fine-tuned model after 5 generations of network iteration.

Preferably, the merging process described in S6 is implemented as follows:

s601, scanning from left to right and from top to bottom according to the human body center feature maps of all the cut images obtained in S302 and the positions of the cut images in the large scene image;

s602, setting threshold value as cropping image width

If the distance between the two central points is smaller than the threshold value, the two central points are represented as the central points of the same person, the distance between the central point and the boundary of the cut image where the central point is located is calculated, the farther the central point is away from the boundary, the lower the probability that the person represented by the central point is cut off is, and the central point is reserved.

The beneficial effects of the invention comprise the following and several points:

(1) the invention provides a multi-person three-dimensional reconstruction method under a wide-field-of-view large scene, which can realize multi-person posture and shape reconstruction with consistent global space by using an end-to-end large-scene single-image multi-person reconstruction frame; the method provides a scale self-adaptive hierarchical representation scheme with human center; a progressive positioning method for ground guidance is provided to overcome the problem of depth ambiguity under single-color camera acquisition.

(2) The invention provides a multi-person three-dimensional reconstruction method under a wide-field-of-view large scene, which solves the problem that the prior art cannot obtain a multi-person three-dimensional reconstruction result with an accurate spatial position in a large scene image.

(3) The invention provides a scale self-adaptive hierarchical representation scheme with a human center, and by setting sliding windows with different sizes, a human body has proper proportion in a cut image, so that the human body is more suitable for network learning, and the robustness of a prediction result is improved; in order to obtain the accurate 3D position of a person in a large scene, a ground-guided progressive positioning method is provided, and scene-level global 3D positioning is converted into local 2D positioning and 3D offset by estimating scene-level camera parameters and public ground, so that accurate global space positioning of multiple persons in the scene is realized.

(4) In order to solve the problem of misalignment of 2D projection in a new scene in the testing process, the model is subjected to fine adjustment in the testing process, the fine adjustment does not involve new operation, the time cost is low, and the position prediction precision of people in the new scene can be effectively improved.

(5) The invention provides a multi-person three-dimensional reconstruction method under a wide-view-field large scene, which obtains an optimal result on data sets Panoptic and Crowd-Location, wherein the Crowd-Location is a new large-scene test data set which is provided and comprises 20 pictures of 2 scenes, and provides rich annotation information which comprises a surrounding frame, a 2D posture, projection of a foothold on an image and a homography matrix for expressing perspective transformation of the ground in the real world and a corresponding image.

Drawings

FIG. 1 is a block diagram of end-to-end large-scene single-image multi-person reconstruction in a multi-person three-dimensional reconstruction method under a wide-field large scene proposed by the present invention;

FIG. 2 is a schematic diagram of a multi-person reconstruction result on a Crowd-Location test set in the multi-person three-dimensional reconstruction method under a wide field of view and a large scene provided by the present invention;

fig. 3 is a comparison diagram of qualitative results of a multi-person three-dimensional reconstruction method in a wide-field large-scene and a multi-person reconstruction method in the mainstream in the prior art.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

Example 1:

s1, preprocessing the large scene image, obtaining the cutting image with different resolution by the self-adaptive hierarchical representation with human as the center, so that the human occupies a proper proportion in the cutting image, and scaling the cutting image to a uniform size for training the network on the basis of keeping the original length-width ratio of the image; the preprocessing process described in S1 mainly includes the steps of:

s101, defining the height of the minimum person and the height of the maximum person in the large scene image as h _min And h _max Defining the upper boundary and the lower boundary of a cutting area as s and e respectively, cutting the large scene image by using a square sliding window, wherein the length of the ith sliding window in the y direction is c _i C, so that the height of the person in the cropped image is half of the height of the cropped image ₁ ＝2×h _min The last sliding window in the y-direction, i.e. the nth sliding window, which is longer by c _n ＝c ₁ ×q ^n-1 And is

Wherein q is a proportionality coefficient; the human-centric adaptive hierarchy described in S1 is represented as follows:

s102, keeping the original length-width ratio of the cut images with different resolutions, unifying the cut images to (512 ) by a bicubic interpolation method, and filling the insufficient parts with 0;

s2, estimating 2D joint points of a large scene image by using the existing 2D joint point estimation method, correcting the 2D joint points which are estimated wrongly or lacked by using an artificial method, and estimating a ground equation and camera parameters by using the 2D joint points; the ground estimation and camera parameter estimation in S2 mainly include the following steps:

a point on the ground;

Its projection point on the image is x _t ＝(u _t ,v _t ) Suppose X _b Is a point on the ground, on which a person stands with a fixed height h, passing through X _b And X _t Is parallel to the ground normal;

s204, obtaining the image according to the pinhole imaging principle

Wherein

Is x _b The homogeneous coordinate of (A), K is a camera internal reference matrix, Z _b Is X _b Depth of (d); since X _b Is a point on the ground, satisfies N ^T X _b When + D is 0, we can get:

projected point of middle point of left and right shoulders

The following equation can be used for calculation:

wherein Z _t Is X _t The depth of (d);

s205, solving the camera parameters and the ground equation by an optimization-based method, wherein the i-th personal loss function is specifically as follows:

wherein L is _Cosine Denotes the cosine distance, λ _{Angle of rotation} ，λ _{Die length} Respectively, the weight of the corresponding loss term;

s206, translating the obtained ground by 0.1 meter along the normal direction to obtain a real ground instead of the ground where the ankle is located;

s3, training a network by using the image preprocessed in S1, wherein the network realizes feature extraction through a backbone network, and further carries out human body detection, 2D position estimation and 3D offset and human body parameter model estimation by using three different branch networks; the specific implementation process of S3 is as follows:

s301, performing feature extraction on an input picture through a backbone network, and further inputting the obtained features into three different branch networks, wherein each branch network consists of two ResNet blocks and batch normalization;

s306, training a human body central feature map and a 2D position feature map to enable a subsequently-learned human body grid to have a proper initial position, training the whole network after 20 iterations, and iterating the whole network for 70 iterations;

s4, obtaining a rough 3D position of the human body by using the camera parameters and the ground equation obtained in S2 and the 2D position obtained in S3 through a ground-guided progressive positioning method, and obtaining an accurate 3D position of the human body by using the 3D offset obtained in S3; the ground-guided progressive positioning method described in S4 mainly includes the steps of:

And

Is the pixel position of the upper left corner of the cropped image in the large scene image, since P is a point on the ground, the rough 3D position P can be calculated from the predicted camera parameters and ground equations by:

wherein

Is the homogeneous coordinate of p;

M _{camera with camera module} ＝M+T _3D

Wherein M is a human body grid under an SMPL standard space;

s403, performing L1 constraint on the point which penetrates the ground most seriously in the human body grid under the camera coordinate system to prevent the phenomenon that people penetrate the ground in a large scene; the penetration loss function corresponding to the penetration loss described in S403 is specifically as follows:

wherein v is _i ∈M _{Camera with a camera module} ，

Is v _i Of (G) is equal to [ N [ ] ^T ,D] ^T 。

S5, performing scene-level fine adjustment on the model in a testing stage to obtain a better 2D projection result, wherein the scene-level fine adjustment does not involve new operation and is low in time cost;

the scene-level fine adjustment described in S5 is implemented as follows:

s501, obtaining a cut image of a picture of a new scene through the preprocessing process in the S1, and obtaining a corresponding ground equation and camera parameters through the S2;

s502, fixing most parts of the network S3, optimizing and obtaining the 2D position characteristic diagram and two branch networks of the SMPL and the offset characteristic diagram, and obtaining a model after scene-level fine tuning after 5 generations of network iteration;

s6, combining the multi-person reconstruction results of all the cut images, removing the persons repeatedly estimated, and obtaining the multi-person reconstruction results with consistent depth ordering under the wide view field and the large scene; the merging process described in S6 is specifically implemented as follows:

s601, scanning from left to right and from top to bottom according to the human body center feature maps of all the cut images obtained in the S302 and the positions of the cut images in the large scene image;

s602, setting the threshold value as cutting image width

If the distance between the two central points is smaller than the threshold value, the two central points are the central points of the same person, the distance between the central point and the boundary of the cut image is calculated, and the center isThe further a point is from the boundary, the lower the likelihood that the person it represents is truncated, and the central point is preserved.

Example 2:

referring to fig. 1 to 3, a specific implementation process of the multi-person three-dimensional reconstruction method in a wide-field large scene based on embodiment 1 is as follows:

data preprocessing:

the disclosed Human3.6M, MuCo-3DHP, Agora, PANDA datasets are used in the present invention, which include population activities under a variety of conditions; in order to train by using the PANDA data set, 2D joint point labeling is carried out on four scenes of OCT Habour, Basketball Cort, University Campus and Huaqiangbei in the PANDA data set, firstly, RMPE is used for carrying out 2D joint point detection, and artificial correction is carried out on joint points which are estimated to be wrong or missing, so that a data set PANDA-Pose is obtained; obtaining a cut image by utilizing an adaptive hierarchical representation scheme taking human as a center for a large scene data set; for unified training, images are fixed to (512 ) by bicubic interpolation on the basis of keeping the original aspect ratio of the images, and the insufficient areas are filled with 0;

(II) estimating ground equations and camera parameters:

the data sets Human3.6M, MuCo-3DHP and Agora provide camera internal reference and camera external reference, a person is supposed to stand on an XOZ plane under a unified coordinate system, the plane is the ground, the Y direction is the standing direction of the person, and the ground equation under the camera coordinate system is obtained by rotating the ground under the unified coordinate system by using the camera external reference;

for the PANDA-Pose data set, estimating a ground equation and camera parameters by using the 2D joint points; firstly, filtering the 2D posture by using prior information, and only keeping the standing posture; assuming that a person has a fixed height, solving a ground equation and camera parameters by using an optimization-based method;

(III) human body shape and position information estimation network:

in the training process, after the image is input into a backbone network to extract features, the features are respectively input into three branch networks; a first branch network predicts a central feature map of a person, a second branch network predicts a 2D position feature map, and a third branch network predicts an SMPL and offset feature map; firstly, training a backbone network, a first branch network and a second branch network to enable a human body grid to have an initial position, and then training the whole network to obtain SMPL parameters, 2D positions and 3D displacements required by human body shape and position estimation;

specifically, the backbone network adopts HRNet-32 to extract features, and each branch network consists of two ResNet blocks and batch normalization;

(IV) the ground guided progressive positioning method comprises the following steps:

on the basis of the steps, obtaining the accurate 3D position of the person by utilizing a ground-guided progressive positioning method; the 2D position obtained by the network estimation is the projection of the foothold on the picture, and the 2D position is converted into the 3D position of the foothold by using a ground equation and camera parameters according to the pinhole imaging principle to obtain the rough 3D position of the person; obtaining a human body grid by utilizing the SMPL obtained by the network estimation, converting the origin of coordinates of the human body grid into a middle point of a left ankle point and a right ankle point, and adding the 3D offset estimated by the network on the basis of a rough 3D position to obtain an accurate 3D position of a human;

(V) scene-level fine tuning and merging:

when a new scene image is reconstructed by multiple persons in a testing stage, scene-level fine adjustment is carried out on the model to obtain a better 2D projection result, the scene-level fine adjustment does not involve new operation and is low in time cost, most parameters of the model are fixed, only two branch networks of a 2D position characteristic diagram and an SMPL (smooth Markov chain) and offset characteristic diagram are obtained in an optimized mode, and the model after the scene-level fine adjustment is obtained after 5 network iterations; and combining the multi-person reconstruction results of all the images, and removing the persons repeatedly estimated to obtain the multi-person reconstruction results with the uniform global spatial distribution.

As shown in fig. 1, an end-to-end large-scene single-image multi-person reconstruction frame proposed by the present invention is shown, a large-scene image is cut according to a hierarchical representation method with a human center, 2D joint points of the large-scene image are obtained by using the existing 2D joint point estimation method and manual correction, a ground equation and camera parameter estimation are performed, human body morphology and position information are estimated by using a network, and an accurate 3D position of a person is obtained by a ground-guided progressive positioning method;

as shown in fig. 2, the multi-person three-dimensional reconstruction result on the Crowd-Location test set of the present invention is shown, and the result can fully indicate that the multi-person reconstruction method in a large scene provided by the present invention can accurately obtain the 3D position of a person, obtain the multi-person reconstruction result with uniform global spatial distribution, and can obtain a reasonable posture;

as shown in fig. 3, showing the comparison of the qualitative results of the present invention with the currently mainstream multi-person reconstruction method, it can be seen that the method can better predict the 3D position of a person, and for the reconstructed human body model, the method can obtain reasonable human body shape and posture estimation;

table 1 and Table 2 show the comparison of the quantitative results of the present invention and the current mainstream multi-person reconstruction method in the case of a crown-Location data set S1 and S2, respectively; SMAP was proposed In 2020 by Zhen et al (Zhen J, Fan Q, Sun J, et al. SMAP: Single-Shot Multi-Person Absolute 3D Point Estimation [ C ]. In Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV), CRMH was proposed In 2020 by Jiang et al (Jiang W, Koutours N, Pavlakoks G, et al. Copherent Reconnation of Multi-projects from A Single-Shot [ C ]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reginition (CVPR), whereas ROMP was proposed In year by Sun et al (Sun Y, bag Q, monkey W.S. simulation of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), and was proposed In ROMP was proposed In 202n et al (balance Q, monkey W3D. C, monkey W1. In 202F 12, see FIG. 12 C.C. (IEEE/12, monkey W.S.: In 202F 12 C. IEEE/CVF reference on Computer Vision and Pattern Registration (CVPR),2022) proposed in 2022, these multi-person reconstruction methods could not be directly applied to large scene images, and their multi-person reconstruction results for large scene images were obtained in a corresponding manner to these methods, and the cropped images were obtained as input to the multi-person reconstruction method by an adaptive hierarchical representation method centered on people, and the camera parameters used were the camera parameters obtained by the present invention;

aiming at CRMH, the transformation from the surrounding frame coordinate system to the global coordinate system is used to obtain the global position, and according to the result obtained by CRMH, the camera parameter of the weak perspective projection of the ith person is pi _i ＝{s _i ,x _i ,y _i Is enclosed by a frame B _i ＝[x _min ,y _min ,x _max ,y _max ]The center of the surrounding frame is c _i ＝[(x _min +x _max )/2,(y _min +y _max )/2]Scale of alpha _i ＝max(x _max -x _min ,y _max -y _min ) According to these parameters, the depth of the ith person

From the calculated depth, the global spatial position of the ith person can be calculated from the following equation:

w and h are the width and height of the large scene image respectively;

aiming at ROMP, the global space position of a person is solved through a PNP algorithm by using the result obtained by ROMP and the camera parameters estimated by the method;

for SMAP and BEV, the two methods can directly obtain the spatial position of a person in an input image, but each image has independent camera coordinate parameters, and the focal distance of each image is the width w of the image _c The principal point is the central point of the image; in the SMAP method, the spatial position of the jth joint point of the ith person is T _ij ＝{X _ij ,Y _ij ,Z _ij The projection point on the large scene is { x } _ij ,y _ij H, depth Z of joint point for unifying it into large scene camera coordinate system _{ij _ Global} ＝Z _ij ×f/w _c And by using the perspective projection principle, the 3D position of the jth joint point of the ith person in the coordinate system of the large scene camera is as follows:

for BEV, only the offsets in the independent camera coordinate system in the cropped image result are calculated instead of all the joint points.

The evaluation index of the quantitative result is Matched and is used for measuring the matching rate between the prediction result and the real mark; the PCOD is used for measuring the accuracy of depth sequencing between people; PPDRror for measuring accuracy of distance prediction result between people; OKS, which measures the accuracy between a predicted posture and a true posture; the data sets are S1 scenes and S2 scenes in Crowd-Location. The reconstruction result of the invention obtains the best effect on the spatial position in the multi-person three-dimensional reconstruction problem under the wide view field and large scene, and simultaneously obtains the reasonable posture;

TABLE 1

TABLE 2

Table 3 lists the quantitative results of the present invention compared to the current mainstream multi-person reconstruction method on panoptric data sets; SMAP was proposed In 2020 by Zhen et al (Zhen J, Fan Q, Sun J, et al. SMAP: Single-Shot Multi-Person Absolute 3D Point Estimation [ C ]. In Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV)), CRMH was proposed In 2020 by Jiang et al (Jiang W, Koutours N, Pavlakoks G, et al. Copherent Reconnation of Multi-projects from a Single [ C ]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reginition (CVPR), whereas ROMP was proposed In year by Sun et al (Sun Y, Bao Q, monkey W.S. simulation of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), and vice versa, In cement J.S.. SMAP: Single-Shot 3D-P analysis [ C ] (IEEE 202C ]. sub J, monkey W.S. 3 D.S.: photograph 3 D.S.. C. (IEEE 202F 12. C.: photograph 3. C.S.. 12. C.S.S.. 12. C.S. the/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2022) was proposed In 2022, and the 3DCrowdNet was proposed In 2022 by Hongsuk et al (Hongsuk C, Gyeong sik M, JoonKyu P, et al, learning to estimate robusts 3d human mesh In-the-world recognized scenes [ C ], In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022). The evaluation index of the quantitative result is MPJPE, which is used for measuring the Euclidean distance between the predicted three-dimensional human body model and the real human body model; the RtError is used for measuring the Euclidean distance between the prediction root node and the real root node; the PCOD is used for measuring the accuracy of depth sequencing between people; PPD error is used for measuring the accuracy of the distance prediction result between people; the data set comprises four subdata sets Haggling, Mafia, Ultim and Pizza commonly used by Panoptic; the reconstruction result of the invention can obtain good effect on the problem of multi-person three-dimensional reconstruction in medium and small scenes;

TABLE 3

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the equivalent replacement or change according to the technical solution and the modified concept of the present invention should be covered by the scope of the present invention.

Claims

1. A multi-person three-dimensional reconstruction method under a wide-field-of-view large scene is characterized by comprising the following steps: the method comprises the following steps:

s4, obtaining a rough 3D position of the human body by utilizing the camera internal reference and the ground equation obtained in S2 and based on the 2D position obtained in S3 through a ground-guided progressive positioning method, and obtaining an accurate 3D position of the human body by combining the 3D offset obtained in S3;

s5, performing scene-level fine adjustment on the model in a testing stage, and performing multi-person reconstruction on a new scene image to obtain a better 2D projection result;

2. The multi-person three-dimensional reconstruction method under the wide-field-of-view large-scene according to claim 1, characterized in that: the preprocessing process described in S1 mainly includes the steps of:

s101, defining the height of the minimum person and the height of the maximum person in the large scene image as h _min And h _max Defining the upper boundary and the lower boundary of a cutting area as s and e respectively, cutting the large scene image by using a square sliding window, wherein the length of the ith sliding window in the y direction is c _i To make the person in the cut-out imageHeight being half of the height of the cropped image, c ₁ ＝2×h _min The last sliding window in the y-direction, i.e. the nth sliding window, which is longer by c _n ＝c ₁ ×q ^n-1 And is

Wherein q is a proportionality coefficient;

the human-centered adaptive hierarchy described in S1 is represented as follows:

s102, keeping the original length-width ratio of the cut images with different resolutions, unifying the cut images to (512 ) through a bicubic interpolation method, and filling the insufficient parts with 0.

3. The multi-person three-dimensional reconstruction method under the wide-field-of-view large-scene according to claim 1, characterized in that: the estimation of the ground equation and the camera parameters in S2 mainly includes the following steps:

Is the ground normal line and N luminance ₂ D is a constant term, 1,the position of the reaction ground is reflected on the ground,

a point on the ground;

s204, obtaining the image according to the pinhole imaging principle

Wherein

Is x _b The homogeneous coordinate of (A), K is a camera internal reference matrix, Z _b Is X _b Depth of (d); because of X _b Is a point on the ground, satisfies N ^T X _b When + D is 0, we can:

projected points of midpoints of left and right shoulders

The following equation can be used:

wherein Z _t Is X _t Depth of (d);

and S206, translating the obtained ground by 0.1 meter along the normal direction to obtain the real ground instead of the ground on which the ankle is positioned.

4. The multi-person three-dimensional reconstruction method under the wide-field-of-view large-scene according to claim 1, characterized in that: the specific implementation process of S3 is as follows:

s304, obtaining an SMPL and an offset characteristic diagram by a third branch network, and estimating posture and shape parameters and 3D offset of the SMPL;

s306, training a human body center characteristic diagram and a 2D position characteristic diagram to enable a subsequently-learned human body grid to have a proper initial position, and training the whole network after 20 iterations, wherein the whole network iterates for 70 iterations.

5. The method for multi-person three-dimensional reconstruction in a wide-field large-scene according to claim 1, characterized in that: the ground-guided progressive positioning method described in S4 mainly includes the steps of:

And

p _{local part} I.e. the 2D position obtained from the 2D position profile, p ═ p _{Local part} +t _p In which

wherein

Is the homogeneous coordinate of p;

s402, shifting delta through 3D position _3d Obtaining an accurate 3D position, T _3D ＝-mean(J _Ankle )+Δ _3d + P, wherein J _Ankle The 3D coordinates of the left ankle and the right ankle are obtained through calculation by the following equation:

M _{camera with a camera module} ＝M+T _3D

Wherein M is a human body grid under an SMPL standard space;

6. The method of claim 5, wherein the method comprises the following steps:

the penetration loss function corresponding to the penetration loss described in S403 is specifically as follows:

wherein v is _i ∈M _{Camera with a camera module} ，

Is v _i Of (G) is [ N ] ^T ,D] ^T 。

7. The multi-person three-dimensional reconstruction method under the wide-field-of-view large-scene according to claim 1, characterized in that: the scene-level fine adjustment described in S5 is implemented as follows:

and S502, fixing most parts of the network S3, optimizing the two branch networks for obtaining the 2D position characteristic diagram and the SMPL and offset characteristic diagram, and obtaining a model after scene-level fine tuning after 5 generations of network iteration.

8. The multi-person three-dimensional reconstruction method under the wide-field-of-view large-scene according to claim 1, characterized in that: the merging process described in S6 is specifically implemented as follows:

s602, setting threshold value as cropping image width

If the distance between the two central points is smaller than the threshold value, the two central points are represented as the central points of the same person, the distance between the central point and the boundary of the cut image where the central point is located is calculated, the farther the central point is away from the boundary, the lower the possibility that the represented person is cut off is, and the central point is reserved.