CN114663917A

CN114663917A - Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device

Info

Publication number: CN114663917A
Application number: CN202210247791.2A
Authority: CN
Inventors: 季向阳; 余杭; 连晓聪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-24

Abstract

The application discloses a multi-person three-dimensional human body pose estimation method and device based on multiple visual angles, wherein the method comprises the following steps: estimating the two-dimensional human body pose of each human body picture based on a transformer mode; carrying out person registration identification on each person of each human body picture by using the two-dimensional human body pose of each person and using a pre-constructed pedestrian re-identification model to determine the two-dimensional joint points of each person; and performing triangularization operation on the two-dimensional joint points of each person in each picture by adopting a multi-viewpoint triangularization mode, fitting the three-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture. Therefore, the technical problems that in the related technology, the occupied operation resources are more, the operation time is slow, the method is difficult to be widely applied to the scenes with poor operation hardware conditions, and the applicability is low are solved.

Description

Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device

Technical Field

The application relates to the technical field of computer vision, in particular to a multi-person three-dimensional human body pose estimation method and device based on multiple visual angles.

Background

The human body posture estimation task is an important research branch in the field of computer vision at present and is also a research hotspot based on application and industrial requirements at present. The human body posture estimation task is divided into three common division modes: dividing the input visual angle according to the number of the provided input visual angles, and dividing the input visual angle into a single visual angle estimation task and a multi-visual angle estimation task; dividing into single scene tasks and multi-person scene tasks according to the number of detected persons; the method can be divided into a two-dimensional estimation task and a three-dimensional estimation task according to target information. According to the basic classification and application scenes, the human body posture estimation task has wide application in the fields of holographic reality, human body simulation, video monitoring, unmanned aerial vehicle groups and the like, and has huge development potential. Human body pose estimation is also the research basis of many computer vision tasks, and the estimation precision of the human body pose estimation has important influence on the effect of downstream tasks. Therefore, the study of the human posture estimation problem has an increasingly important meaning.

The related technology follows three steps of single-image two-dimensional human body posture estimation, multi-image personnel registration identification and three-dimensional human body posture fitting, a multi-path matching algorithm is adopted, the matching algorithm firstly establishes a cross matching matrix by analyzing appearance information and searches for periodic consistency correspondence of two-dimensional postures detected in a plurality of views so as to match different people in multi-view images, the matching algorithm can prune error detection and process partial overlapping between the views under the condition of not knowing the number of real people in the scene, and a good effect is achieved on the cross-link two-dimensional view personnel matching problem. Meanwhile, the related art improves the conventional 3DPS (3D visual projection display system) method, and generates two-dimensional gesture clusters each containing the two-dimensional gestures of the same person under different views by matching the detected two-dimensional gestures among a plurality of views, thereby solving the correspondence problem on the body level.

However, in the related art, there are still many drawbacks:

one, two-dimensional human pose estimation methods are poor. The two-dimensional human body pose designed by the related technology is a CPN (Cascaded Pyramid Network) trained on an MSCOCO two-dimensional human body pose data set, the CPN system is provided with two sub-networks, one global Network is used for global human positioning and rough key point estimation, and the other local fine tuning Network is used for estimating the fine position of a joint point. The method has the advantages of poor precision, slow running time and more occupied computing resources.

And secondly, the operation speed of the whole system is low. The system basically accords with other similar systems in processing steps and follows the pipeline process of single-frame two-dimensional attitude estimation, personnel matching and three-dimensional fitting. In the related technology, too many links are designed in each system to ensure accuracy, for example, in a personnel matching link, a plurality of technologies such as similarity estimation, pedestrian re-identification, cycle consistency and the like are designed to ensure fitting accuracy, and in a three-dimensional fitting link, a 3DPS method with large computation amount is used to ensure fitting accuracy, so that a large amount of computation resources are occupied, and a lot of computation time is prolonged.

In summary, in the related art, since the occupied computing resources are more, the running time is slower, and the related art is difficult to be widely applied to the scenes with poorer computing hardware conditions, the applicability is lower, and improvement is urgently needed.

Disclosure of Invention

The application provides a multi-user three-dimensional human body pose estimation method and device based on multiple visual angles, and aims to solve the technical problems that in the related art, the occupied computing resources are more, the running time is slow, the method and device are difficult to be widely applied to scenes with poor computing hardware conditions, the applicability is low, the human body pose estimation accuracy is poor, and the like.

An embodiment of a first aspect of the application provides a multi-person three-dimensional human body pose estimation method based on multiple visual angles, which includes the following steps: estimating the two-dimensional human body pose of each person of each human body picture based on a transform mode; carrying out person registration identification on each person of each human body picture by utilizing the two-dimensional human body pose of each person and utilizing a pre-constructed pedestrian re-identification model to determine the two-dimensional joint points of each person; and carrying out triangularization operation on the two-dimensional joint points of each person in each picture by adopting a multi-viewpoint triangularization mode, fitting the two-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture.

Optionally, in an embodiment of the present application, the estimating, based on a transform manner, a two-dimensional human body pose of each person of each human body picture includes: detecting the two-dimensional human body pose of each person in each human body picture by using a swin-transformer skeleton; or estimating the joint point position of each person by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

Optionally, in an embodiment of the present application, before performing person registration identification on each person in each human body picture by using the pre-constructed pedestrian re-identification model, the method further includes: obtaining a common data set for training the model; and training a pedestrian re-recognition model constructed based on deep learning by using the common data set to generate the pre-constructed pedestrian re-recognition model.

Optionally, in an embodiment of the present application, the estimating, based on a transform manner, a two-dimensional human body pose of each person of each human body picture includes: and acquiring a two-dimensional human body posture estimation of each picture based on the deformation framework of ViT.

An embodiment of a second aspect of the present application provides a multi-person three-dimensional human body pose estimation apparatus based on multiple viewing angles, including: the pose estimation module is used for estimating the two-dimensional human body pose of each human body picture based on a transformer mode; the identification module is used for carrying out personnel registration identification on each person of each human body picture by utilizing the two-dimensional human body pose of each person and utilizing a pre-constructed pedestrian re-identification model to determine the two-dimensional joint points of each person; and the generating module is used for carrying out triangularization operation on the two-dimensional joint points of each person in each picture by adopting a multi-viewpoint triangularization mode, fitting the two-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture.

Optionally, in an embodiment of the present application, the pose estimation module is further configured to detect a two-dimensional human pose of each person of each human picture by using a swin-transformer skeleton; or estimating the joint point position of each person by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

Optionally, in an embodiment of the present application, the identification module includes: an acquisition unit for acquiring a common data set for training a model; and the generating unit is used for training a pedestrian re-recognition model constructed based on deep learning by using the common data set to generate the pre-constructed pedestrian re-recognition model.

Optionally, in an embodiment of the present application, the pose estimation module includes: and the posture estimation unit is used for acquiring the two-dimensional human body posture estimation of each picture based on the ViT deformation frame.

An embodiment of a third aspect of the present application provides an electronic device, including: the multi-view-angle-based multi-person three-dimensional human body pose estimation method comprises the following steps of storing a plurality of human body poses, and executing a computer program stored on the storage and capable of running on the processor.

A fourth aspect embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the program is executed by a processor to implement the multi-perspective-based multi-person three-dimensional human body pose estimation method according to any one of claims 1 to 4.

According to the embodiment of the application, the two-dimensional human body pose of each person of each human body picture can be estimated based on a transformer mode, the person registration identification is carried out by utilizing a pedestrian heavy identification model, the two-dimensional joint points of each person are extracted, the triangulation operation is carried out on the two-dimensional key points in a multi-view triangulation mode, the three-dimensional human body posture is generated, harsh hardware equipment operation conditions are not needed, the estimation of the three-dimensional human body pose of multiple persons can be completed on the premise that certain precision is guaranteed, and the applicability is stronger. Therefore, the technical problems that in order to realize multi-person three-dimensional human body pose estimation in the related technology, more operation resources are occupied during operation, the operation time is slow, the method is difficult to be widely applied to scenes with poor operation hardware conditions, and the applicability is low are solved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a multi-person three-dimensional human body pose estimation method based on multiple viewing angles according to an embodiment of the present application;

FIG. 2 is a flowchart of a multi-person three-dimensional human body pose estimation method based on multiple viewing angles according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a multi-person three-dimensional human body pose estimation apparatus based on multiple viewing angles according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application.

The following describes a multi-person three-dimensional human body pose estimation method and device based on multiple viewing angles according to an embodiment of the present application with reference to the accompanying drawings. Aiming at the technical problems that the related technology mentioned in the background technology center causes more occupied computing resources and slower running time during running, is difficult to be widely applied to scenes with poorer computing hardware conditions and has lower applicability in order to realize multi-user three-dimensional human body pose estimation, the application provides a multi-user three-dimensional human body pose estimation method based on multiple visual angles, in the method, the two-dimensional human body pose of each human body picture can be estimated based on a transformer mode, and utilizes the pedestrian re-identification model to carry out personnel registration identification, extracts the two-dimensional joint points of each person, triangularization operation is carried out on the two-dimensional key points in a multi-viewpoint triangularization mode, so that three-dimensional human body postures are generated without harsh hardware equipment operation conditions, on the premise of ensuring certain precision, the multi-person three-dimensional human body pose estimation can be completed, and the applicability is stronger. Therefore, the technical problems that in order to realize multi-person three-dimensional human body pose estimation in the related technology, more operation resources are occupied during operation, the operation time is slow, the method is difficult to be widely applied to scenes with poor operation hardware conditions, and the applicability is low are solved.

Specifically, fig. 1 is a schematic flow chart of a multi-person three-dimensional human body pose estimation method based on multiple viewing angles according to an embodiment of the present application.

As shown in fig. 1, the multi-person three-dimensional human body pose estimation method based on multiple visual angles includes the following steps:

in step S101, a two-dimensional human body pose of each person of each human body picture is estimated based on the transform method.

It can be understood that the two-dimensional human body posture estimation can be divided into single human body posture estimation and multi-human body posture estimation by the number of people in the image, and the single human body posture estimation can be regarded as positioning a given single joint point in the image under the condition of giving the number of people and determining the joint point; however, the multi-person posture estimation generally cannot determine the number of people and the position of each person in the input (picture or video) in advance, and the probability of human body occlusion and self-occlusion is higher, so that the difficulty of the multi-person posture estimation task is higher compared with the single-person posture estimation task.

In the actual execution process, the transform frame can be used for estimating the two-dimensional human body pose of each person of each human body picture based on the data transfer and transformation modes, so that the estimation of the single-person human body pose or the multi-person human body pose is realized, and a foundation is laid for the subsequent three-dimensional human body pose conversion.

Optionally, in an embodiment of the present application, estimating the two-dimensional human body pose of each person of each human body picture based on a transform manner includes: and obtaining a two-dimensional human body posture estimation of each picture based on the ViT deformation framework.

Under some conditions, the two-dimensional human body posture estimation of each picture can be obtained based on the ViT deformation frame, a foundation is laid for the follow-up determination of the two-dimensional joint points of each person, the operation consumption and the operation time can be reduced on the premise of ensuring higher precision, and the method and the device are favorable for being widely applied to scenes with lower hardware equipment.

Optionally, in an embodiment of the present application, estimating a two-dimensional human body pose of each person of each human body picture based on a transform manner includes: detecting the two-dimensional human body pose of each human body picture by using the swin-transformer skeleton; or estimating the joint point position of each person by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

In other cases, the embodiment of the present application may use a variant of the ViT model, that is, the modified swin-transducer skeleton to replace the human body detection link in the CPN from top to bottom, or use the modified swin-transducer skeleton to completely replace the whole CPN link (human body detection + keypoint estimation link).

In step S102, performing person registration recognition on each person in each human image by using the two-dimensional human pose of each person and using a pre-constructed pedestrian re-recognition model, and determining a two-dimensional joint point of each person.

As can be understood by those skilled in the art, the pedestrian re-identification technology is an artificial intelligence technology that uses a computer vision technology to retrieve whether a specific pedestrian exists in an image or a video sequence, and can be applied to scenes such as city monitoring. The flow for constructing the pedestrian re-identification model in advance can be divided into three parts, namely feature learning, metric learning and sequencing optimization.

Furthermore, the embodiment of the application can perform personnel registration identification on each person in each human body picture based on the pre-constructed pedestrian re-identification model, so that the extraction of two-dimensional joint points of each person is realized, a data base is provided for the subsequent triangularization technology, and the multi-view-angle-based pose estimation of the multi-person three-dimensional human body is facilitated.

Optionally, in an embodiment of the present application, before performing person registration identification on each person in each person picture by using a pre-constructed pedestrian re-identification model, the method further includes: obtaining a common data set for training the model; and training the pedestrian re-recognition model constructed based on deep learning by using the common data set to generate a pre-constructed pedestrian re-recognition model.

Specifically, the construction step of the pedestrian re-identification model comprises the following steps:

1. and (4) data acquisition. Typical sources of data include: and acquiring a data set for constructing a pedestrian re-identification model from original data intercepted from the video or data of a monitoring camera in the same scene.

2. And generating a pedestrian frame. The construction of the pedestrian re-identification model can cut the pedestrian out from the way through a manual mode, a pedestrian detection mode, a tracking mode and the like from the video data, so as to generate a pedestrian frame.

3. And marking training data. The marked data comprises information such as a camera label and a pedestrian label.

4. And a training module.

5. And carrying out model test in a test environment.

In the embodiment of the application, a public data set can be used for training the pedestrian re-recognition model constructed by deep learning, three steps of data acquisition, pedestrian frame generation and training data labeling can be omitted, the purpose of simplifying the process of constructing the pedestrian re-recognition model is achieved, and therefore the requirement on hardware equipment is relaxed, and the embodiment of the application can be widely applied to scenes with lower hardware equipment conditions while higher accuracy is guaranteed.

In step S103, performing triangulation on the two-dimensional joint points of each person in each picture by using a multi-view triangulation method, fitting the three-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture.

As a possible implementation manner, the embodiment of the application may adopt a multi-view triangulation technology, and directly fit the two-dimensional joint points determined by each person in each picture to the three-dimensional key points in the three-dimensional space through triangulation operation on the two-dimensional joint points determined by each person in each picture, so as to generate the three-dimensional human body posture. The embodiment of the application adopts the triangulation technology, has small calculation scale, can effectively improve the operation rate, and can be widely applied to scenes with low actual precision requirement and poor hardware equipment condition.

The multi-view-based multi-person three-dimensional human body pose estimation method according to the embodiment of the present application is described in detail with reference to fig. 2.

As shown in fig. 2, the embodiment of the present application includes the following steps:

step S201: two-dimensional human body pose estimation is performed using a transform framework. In the actual execution process, the transform frame can be used for estimating the two-dimensional human body pose of each person of each human body picture based on the data transfer and transformation modes, so that the estimation of the single-person human body pose or the multi-person human body pose is realized, and a foundation is laid for the subsequent three-dimensional human body pose conversion.

Specifically, the two-dimensional human body posture estimation can be performed in the following two ways:

1. according to the embodiment of the application, a human body detection link in the CPN from top to bottom can be replaced by a variant of the ViT model, namely an improved swin-transducer skeleton;

2. the improved swin-transformer skeleton is used for completely replacing the whole CPN link (human body detection and key point estimation link).

It is understood that the ViT model is a big breakthrough of the transform technology applied to the computer vision field, and marks the beginning of the expression models that can be unified for the two fields of natural language processing and computer vision. The ViT module takes the image blocks divided by fixed size as input and sends them into the transform board to finally obtain the results of image tasks such as image division and image detection, and basically can obtain good effect.

However, the ViT model has certain defects, firstly, the ViT model can accept the input of excessive information due to the fact that visual entities are very different and effective information in the image is not uniformly aggregated; secondly, the input in the form of images causes ViT the computation complexity of the model operation to be the square of the image scale, thus causing the computation to be too large. In summary, these two drawbacks result in ViT not being well suited to most practical environments and having high requirements for computing devices.

In order to solve the problem, the swin-transformer model adopts sliding window operation, so that adaptive information can be input into the model, and the sliding window operation scheme can obtain non-overlapping local windows through a multi-head self-attention mechanism in the model. Meanwhile, the module also allows cross-window and cross-scale linkage of information, so that effective information aggregation in the middle process of different scales can be neutralized, the receptive field is improved, and higher efficiency can be obtained. The structure can reduce the operation complexity of the basic image method given to the transformer from the square of the image scale to the linearity, thereby expanding the actual application range of the method.

Step S202: person registration is performed using pedestrian re-identification techniques. Furthermore, the embodiment of the application can perform personnel registration identification on each person in each human body picture based on the pre-constructed pedestrian re-identification model, so that the extraction of two-dimensional joint points of each person is realized, a data base is provided for the subsequent triangularization technology, and the multi-view-angle-based pose estimation of the multi-person three-dimensional human body is facilitated.

The construction method of the pedestrian re-identification model comprises the following steps:

1. and (6) data acquisition. Typical sources of data include: and acquiring a data set for constructing a pedestrian re-identification model from original data intercepted from the video or data of a monitoring camera in the same scene.

2. And generating a pedestrian frame. The construction of the pedestrian re-identification model can cut out pedestrians from the way in a manual mode, a pedestrian detection mode, a tracking mode and the like from the video data, and therefore a pedestrian frame is generated.

4. And a training module.

5. And carrying out model test in a test environment.

Step S203: and (4) completing three-dimensional information fitting by using a triangulation technology. As a possible implementation manner, the embodiment of the application may adopt a multi-view triangulation technology, and directly fit the two-dimensional joint points determined by each person in each picture to the three-dimensional key points in the three-dimensional space through triangulation operation on the two-dimensional joint points determined by each person in each picture, so as to generate the three-dimensional human body posture. The embodiment of the application adopts a triangulation technology, has small calculation scale, can effectively improve the operation rate, and can be widely applied to scenes with low actual precision requirement and poor hardware equipment condition.

According to the multi-view-angle-based multi-person three-dimensional human body pose estimation method, the two-dimensional human body pose of each person of each human body picture can be estimated based on a transformer mode, the person registration and identification are carried out by utilizing a pedestrian weight identification model, the two-dimensional joint points of each person are extracted, the two-dimensional key points are triangulated in a multi-view-point triangularization mode, the three-dimensional human body posture is generated, harsh hardware equipment operation conditions are not needed, multi-person three-dimensional human body pose estimation can be completed on the premise that certain precision is guaranteed, and the applicability is stronger. Therefore, the technical problems that in order to realize multi-person three-dimensional human body pose estimation in the related technology, more operation resources are occupied during operation, the operation time is slow, the method is difficult to be widely applied to scenes with poor operation hardware conditions, and the applicability is low are solved.

Next, a multi-person three-dimensional human body pose estimation apparatus based on multiple viewing angles according to an embodiment of the present application will be described with reference to the drawings.

Fig. 3 is a block diagram of a multi-person three-dimensional human body pose estimation apparatus based on multiple viewing angles according to an embodiment of the present application.

As shown in fig. 3, the multi-perspective-based multi-person three-dimensional human body pose estimation apparatus 10 includes: a pose estimation module 100, an identification module 200, and a generation module 300.

Specifically, the pose estimation module 100 is configured to estimate a two-dimensional human pose of each person of each human image based on a transform manner.

And the identification module 200 is configured to perform person registration identification on each person in each human body picture by using the two-dimensional human body pose of each person and using a pre-established pedestrian re-identification model, so as to determine a two-dimensional joint point of each person.

And the generating module 300 is configured to perform triangulation on the two-dimensional joint points of each person in each picture by using a multi-view triangulation manner, and fit the two-dimensional joint points to three-dimensional key points in a three-dimensional space to generate a three-dimensional human body posture.

Optionally, in an embodiment of the present application, the pose estimation module is further configured to detect a two-dimensional human pose of each person of each human image by using the spin-transformer skeleton; or estimating the position of each joint point by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

Optionally, in an embodiment of the present application, the identifying module 200 includes: the device comprises an acquisition unit and a generation unit.

The acquisition unit is used for acquiring a common data set used for training the model.

And the generation unit is used for training the pedestrian re-recognition model constructed based on the deep learning by using the common data set to generate a pre-constructed pedestrian re-recognition model.

Optionally, in an embodiment of the present application, the pose estimation module 100 includes: and an attitude estimation unit.

And the posture estimation unit is used for acquiring two-dimensional human body posture estimation of each picture based on the ViT deformation frame.

It should be noted that the explanation of the embodiment of the multi-user three-dimensional human body pose estimation method based on multiple viewing angles is also applicable to the multi-user three-dimensional human body pose estimation device based on multiple viewing angles of the embodiment, and details are not repeated here.

According to the multi-view-angle-based multi-person three-dimensional human body pose estimation device, the two-dimensional human body pose of each person of each human body picture can be estimated based on a transformer mode, a pedestrian weight identification model is used for carrying out personnel registration identification, two-dimensional joint points of each person are extracted, the two-dimensional key points are triangulated in a multi-view-point triangularization mode, then a three-dimensional human body posture is generated, harsh hardware equipment operation conditions are not needed, multi-person three-dimensional human body pose estimation can be completed on the premise that certain precision is guaranteed, and the applicability is stronger. Therefore, the technical problems that in order to realize multi-person three-dimensional human body pose estimation in the related technology, more operation resources are occupied during operation, the operation time is slow, the method is difficult to be widely applied to scenes with poor operation hardware conditions, and the applicability is low are solved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

memory 401, processor 402, and computer programs stored on memory 401 and executable on processor 402.

The processor 402 executes the program to implement the multi-view-based multi-person three-dimensional human body pose estimation method provided in the above embodiments.

Further, the electronic device further includes:

a communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing computer programs executable on the processor 402.

Memory 401 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 401, the processor 402 and the communication interface 403 are implemented independently, the communication interface 403, the memory 401 and the processor 402 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may complete mutual communication through an internal interface.

The processor 402 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the multi-perspective-based multi-person three-dimensional human body pose estimation method as above.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A multi-person three-dimensional human body pose estimation method based on multiple visual angles is characterized by comprising the following steps:

estimating the two-dimensional human body pose of each person of each human body picture based on a transform mode;

carrying out person registration identification on each person of each human body picture by utilizing the two-dimensional human body pose of each person and utilizing a pre-constructed pedestrian re-identification model to determine the two-dimensional joint points of each person; and

and carrying out triangularization operation on the two-dimensional joint points of each person in each picture by adopting a multi-viewpoint triangularization mode, fitting the two-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture.

2. The method of claim 1, wherein estimating the two-dimensional human pose of each person for each human picture based on a transform approach comprises:

detecting the two-dimensional human body pose of each human body picture by utilizing the swin-transformer skeleton;

or estimating the joint point position of each person by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

3. The method according to claim 1, wherein before performing the person registration recognition on each person in each human body picture by using the pre-constructed pedestrian re-recognition model, the method further comprises:

obtaining a common data set for training the model;

and training a pedestrian re-recognition model constructed based on deep learning by using the common data set to generate the pre-constructed pedestrian re-recognition model.

4. The method of claim 1, wherein estimating the two-dimensional human pose of each human picture based on a transform approach comprises:

and acquiring the two-dimensional human body posture estimation of each picture based on the ViT deformation framework.

5. A multi-person three-dimensional human body pose estimation device based on multiple visual angles is characterized by comprising:

the pose estimation module is used for estimating the two-dimensional human body pose of each human body picture based on a transformer mode;

the identification module is used for carrying out personnel registration identification on each person of each human body picture by utilizing the two-dimensional human body pose of each person and utilizing a pre-constructed pedestrian re-identification model to determine the two-dimensional joint points of each person; and

and the generating module is used for carrying out triangularization operation on the two-dimensional joint points of each person in each picture by adopting a multi-viewpoint triangularization mode, fitting the two-dimensional joint points to three-dimensional key points in a three-dimensional space, and generating a three-dimensional human body posture.

6. The apparatus of claim 5, wherein the pose estimation module is further configured to detect a two-dimensional human pose for each person of the each human picture using a swin-transformer skeleton; or estimating the joint point position of each person by using the swin-transformer skeleton, and determining the two-dimensional human body pose.

7. The apparatus of claim 5, wherein the identification module comprises:

an acquisition unit for acquiring a common data set for training a model;

and the generating unit is used for training a pedestrian re-recognition model constructed based on deep learning by using the common data set to generate the pre-constructed pedestrian re-recognition model.

8. The apparatus of claim 5, wherein the pose estimation module comprises:

and the posture estimation unit is used for acquiring the two-dimensional human body posture estimation of each picture based on the ViT deformation frame.

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the multi-perspective based multi-person three-dimensional human body pose estimation method according to any one of claims 1 to 4.

10. A computer-readable storage medium on which a computer program is stored, the program being executed by a processor for implementing the multi-perspective-based multi-person three-dimensional human body pose estimation method according to any one of claims 1 to 4.