CN109785322B

CN109785322B - Monocular human body posture estimation network training method, image processing method and device

Info

Publication number: CN109785322B
Application number: CN201910099220.7A
Authority: CN
Inventors: 钱晨; 林君仪; 刘文韬
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2021-07-02
Anticipated expiration: 2039-01-31
Also published as: CN109785322A

Abstract

The embodiment of the invention discloses a monocular human body posture estimation network training method, an image processing method and an image processing device. The method comprises the following steps: obtaining a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle; and training a human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, so that first three-dimensional feature data obtained by the first sample picture through the human body posture estimation network and second three-dimensional feature data obtained by the second sample picture through the human body posture estimation network meet the preset constraint condition after rotating according to a preset rotation relation.

Description

Monocular human body posture estimation network training method, image processing method and device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a monocular human body posture estimation network training method, an image processing method and an image processing device.

Background

Monocular three-dimensional human body posture estimation is an important class in human body related computer vision problems, and aims to give a picture with a human body and calculate three-dimensional space positions of a plurality of pre-defined characteristic points on the human body.

The monocular three-dimensional human body posture estimation problem can be solved by a deep learning method, namely a convolutional neural network method, but the existing algorithm based on the deep neural network depends on a large amount of manual labeling data based on a motion capture system, the system is complex to deploy, and people often need to wear specific equipment to obtain the system under a strict acquisition environment, so that the wide application is limited. In addition, the three-dimensional body posture estimated by monocular (or single picture) at present loses the three-dimensional structure information of the human body, a plurality of three-dimensional postures may correspond to the same two-dimensional posture, and most of the three-dimensional postures do not accord with the anthropometric constraint, such as unreasonable length and angle of limbs. Based on the above, the precision of human body three-dimensional posture estimation is seriously reduced under the complex conditions of large-scale human body posture change, background environment, camera visual angle change and the like.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the invention provides a monocular human body posture estimation network training method, an image processing method and an image processing device.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

the embodiment of the invention provides a monocular human body posture estimation network training method, which comprises the following steps:

obtaining a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle;

and training a human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, so that first three-dimensional feature data obtained by the first sample picture through the human body posture estimation network and second three-dimensional feature data obtained by the second sample picture through the human body posture estimation network meet the preset constraint condition after rotating according to a preset rotation relation.

In the above scheme, the training of the human body posture estimation network according to the first sample picture, the second sample picture and the preset constraint condition includes:

inputting the first sample picture into a human body posture estimation network to obtain first three-dimensional characteristic data corresponding to the first visual angle;

performing data processing on the first three-dimensional characteristic data according to a preset rotation parameter to obtain third three-dimensional characteristic data corresponding to the second visual angle;

inputting the second sample picture into a reference human body posture estimation network to obtain second three-dimensional feature data corresponding to the second visual angle; the reference human body posture estimation network and the human body posture estimation network have the same network structure;

and training the human body posture estimation network according to the third three-dimensional characteristic data and the second three-dimensional characteristic data so as to adjust network parameters of the human body posture estimation network.

In the foregoing solution, the training the human body posture estimation network according to the third three-dimensional feature data and the second three-dimensional feature data includes:

and calculating a loss function according to the third three-dimensional characteristic data and the second three-dimensional characteristic data, and when the loss function does not meet the preset constraint condition, adjusting network parameters of the human body posture estimation network to train the human body posture estimation network, and terminating the training of the human body posture estimation network until the loss function meets the preset constraint condition.

In the above solution, the preset rotation parameter is determined based on a difference degree between the second view and the first view.

In the foregoing solution, the obtaining the first sample picture and the second sample picture includes:

respectively obtaining a first picture corresponding to a first visual angle and a second picture corresponding to a second visual angle; the first picture and the second picture correspond to a same sample target object;

respectively obtaining first two-dimensional key point information of the first picture and second two-dimensional key point information of the second picture based on a key point detection network;

and generating a first sample picture based on the first two-dimensional key point information, and generating a second sample picture based on the second two-dimensional key point information.

The embodiment of the invention also provides an image processing method, which comprises the following steps: obtaining a picture to be processed; the picture to be processed comprises a target object;

acquiring two-dimensional key point information of the picture to be processed based on a key point detection network, and generating a two-dimensional skeleton picture corresponding to the target object based on the two-dimensional key point information;

and obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network.

In the foregoing solution, the obtaining target three-dimensional feature data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network includes:

inputting the two-dimensional skeleton picture into the human body posture estimation network to obtain initial three-dimensional characteristic data;

and adjusting the initial three-dimensional characteristic data to obtain target three-dimensional characteristic data.

In the above scheme, the human body posture estimation network is obtained by training based on the monocular human body posture estimation network training method according to the embodiment of the present invention.

The embodiment of the invention also provides a monocular human body posture estimation network training device, which comprises a first processing unit and a network training unit; wherein,

the first processing unit is used for obtaining a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle;

the network training unit is configured to train a human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, which are obtained by the first processing unit, so that after the first three-dimensional feature data of the first sample picture obtained through the human body posture estimation network is rotated according to a preset rotation relationship, the first three-dimensional feature data and the second three-dimensional feature data of the second sample picture obtained through the human body posture estimation network meet the preset constraint condition.

In the above scheme, the network training unit is configured to input the first sample picture into a human body posture estimation network, and obtain first three-dimensional feature data corresponding to the first view; performing data processing on the first three-dimensional characteristic data according to a preset rotation parameter to obtain third three-dimensional characteristic data corresponding to the second visual angle; inputting the second sample picture into a reference human body posture estimation network to obtain second three-dimensional feature data corresponding to the second visual angle; the reference human body posture estimation network and the human body posture estimation network have the same network structure; and training the human body posture estimation network according to the third three-dimensional characteristic data and the second three-dimensional characteristic data so as to adjust network parameters of the human body posture estimation network.

In the foregoing scheme, the network training unit is configured to calculate a loss function according to the third three-dimensional feature data and the second three-dimensional feature data, and when the loss function does not satisfy the preset constraint condition, adjust a network parameter of the human body posture estimation network to train the human body posture estimation network, until the loss function satisfies the preset constraint condition, terminate training of the human body posture estimation network.

In the foregoing solution, the network training unit is configured to determine a preset rotation parameter based on a difference degree between the second perspective and the first perspective.

In the foregoing solution, the first processing unit is configured to obtain a first picture corresponding to a first view and a second picture corresponding to a second view, respectively; the first picture and the second picture correspond to a same sample target object; respectively obtaining first two-dimensional key point information of the first picture and second two-dimensional key point information of the second picture based on a key point detection network; and generating a first sample picture based on the first two-dimensional key point information, and generating a second sample picture based on the second two-dimensional key point information.

The embodiment of the invention also provides an image processing device, which comprises an acquisition unit and an image processing unit; wherein,

the acquisition unit is used for acquiring a picture to be processed; the picture to be processed comprises a target object;

the image processing unit is used for obtaining two-dimensional key point information of the picture to be processed based on a key point detection network and generating a two-dimensional skeleton picture corresponding to the target object based on the two-dimensional key point information; and obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network.

In the above scheme, the image processing unit is configured to input the two-dimensional skeleton picture into the human body posture estimation network to obtain initial three-dimensional feature data; and adjusting the initial three-dimensional characteristic data to obtain target three-dimensional characteristic data.

In the above scheme, the human body posture estimation network is obtained based on training of the monocular human body posture estimation network training device according to the embodiment of the present invention.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the monocular human body posture estimation network training method in the embodiment of the invention are realized; alternatively, the program implements the steps of the image processing method according to the embodiment of the present invention when executed by a processor.

The embodiment of the invention also provides a monocular human body posture estimation network training device which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the monocular human body posture estimation network training method provided by the embodiment of the invention.

The embodiment of the invention also provides an image processing device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the image processing method of the embodiment of the invention.

The embodiment of the invention provides a monocular human body posture estimation network training method, an image processing method and a device, wherein the network training method comprises the following steps: obtaining a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle; and training a human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, so that first three-dimensional feature data obtained by the first sample picture through the human body posture estimation network and second three-dimensional feature data obtained by the second sample picture through the human body posture estimation network meet the preset constraint condition after rotating according to a preset rotation relation. By adopting the technical scheme of the embodiment of the invention, on one hand, two-dimensional texture information is stripped by taking two-dimensional skeleton pictures at different visual angles as training data of a human body posture estimation network, and the common characteristic of three-dimensional structure information related to the human body posture is reserved; on the other hand, three-dimensional human body structure characteristics (namely first three-dimensional characteristic data and second three-dimensional characteristic data) representing human body structures are obtained through a weak supervision training mode based on the common characteristics (namely two-dimensional skeleton pictures at different visual angles), the obtained three-dimensional human body structure characteristics are fused into a human body posture estimation network according to preset constraint conditions, more accurate three-dimensional human body structure information is obtained in the fusion process, accordingly, the dependence of a network model on labeling data is reduced, the precision of the network model is greatly improved, and particularly, very high precision is still obtained under complex scenes of large-range human body posture change, background environment, camera visual angle change and the like.

Drawings

FIG. 1 is a schematic flow chart of a monocular human body pose estimation network training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application of the monocular human body pose estimation network training method according to the embodiment of the present invention;

FIG. 3 is a flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an application of the image processing method according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of a structure of a monocular human body posture estimation network training device according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a structure of an image processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware component structure of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a monocular human posture estimation network training method. FIG. 1 is a schematic flow chart of a monocular human body pose estimation network training method according to an embodiment of the present invention; as shown in fig. 1, the method includes:

step 101: obtaining a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle;

step 102: and training a human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, so that first three-dimensional feature data obtained by the first sample picture through the human body posture estimation network and second three-dimensional feature data obtained by the second sample picture through the human body posture estimation network meet the preset constraint condition after rotating according to a preset rotation relation.

In an optional embodiment of the present invention, the obtaining the first sample picture and the second sample picture includes: respectively obtaining a first picture corresponding to a first visual angle and a second picture corresponding to a second visual angle; the first picture and the second picture correspond to a same sample target object; respectively obtaining first two-dimensional key point information of the first picture and second two-dimensional key point information of the second picture based on a key point detection network; and generating a first sample picture based on the first two-dimensional key point information, and generating a second sample picture based on the second two-dimensional key point information.

In this embodiment, the first sample picture and the second sample picture are obtained by a first picture and a second picture respectively acquired by an image acquisition device at a first viewing angle and a second viewing angle. The image capturing device may be a camera, or a mobile terminal with an image capturing module, such as a mobile phone.

Wherein the first perspective and the second perspective respectively represent a relative positional relationship between the image capture device and the target object. The relative position relation between the image acquisition equipment and the target object when acquiring the first picture can be represented by a first visual angle, and the relative position relation between the image acquisition equipment and the target object when acquiring the second picture can be identified by a second visual angle. As an example, with a straight line corresponding to a position right in front of the target object as a reference, when the image capturing device captures the first picture, an included angle between a connecting line between the image capturing device and the target object and the straight line corresponding to the position right in front may be used as the first viewing angle; when the image acquisition equipment acquires the second picture, an included angle between a connecting line between the image acquisition equipment and the target object and a straight line corresponding to the front can be used as a second visual angle.

As an example, if a plurality of image capturing devices are provided in a circular area centered on a target object and having a distance R from the target object, a first picture may be captured by a first image capturing device of the plurality of image capturing devices, and a second picture may be captured by a second image capturing device of the plurality of image capturing devices (the second image capturing device is different from the first image capturing device).

In this embodiment, a first sample picture corresponding to a first picture is obtained through a key point detection network, and a second sample picture corresponding to a second picture is obtained through the key point detection network, where the first picture and the second picture are two-dimensional pictures corresponding to the same target object and different viewing angles. Respectively inputting a first picture and a second picture into a key point detection network to obtain first two-dimensional key point information corresponding to the first picture and second two-dimensional key point information corresponding to the second picture; it can be understood that the two-dimensional coordinates of the key points of the target object in the first picture and the two-dimensional coordinates of the key points of the target object in the second picture are respectively obtained through the key point detection network; wherein, as an example, the key points of the target object are bone key points of the target object, such as joint points; of course, other key points capable of calibrating the limb of the target object may also be used as the key points in the embodiment; further performing interpolation processing on two adjacent two-dimensional key points in the first two-dimensional key point information to obtain a two-dimensional skeleton picture (namely a first sample picture) corresponding to the first picture; correspondingly, interpolation processing is performed on two adjacent two-dimensional key points in the second two-dimensional key point information, and a two-dimensional skeleton picture (namely, a second sample picture) corresponding to the second picture is obtained.

In practical applications, the key point detection network may be a regression network or a classification network. As an example, the key point detection network at least includes a convolution layer and a pooling layer, and two-dimensional key point information corresponding to the picture is obtained through the key point detection network.

In an optional embodiment of the present invention, the training of the human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition includes: inputting the first sample picture into a human body posture estimation network to obtain first three-dimensional characteristic data corresponding to the first visual angle; rotating the first three-dimensional characteristic data according to a preset rotation relation to obtain third three-dimensional characteristic data corresponding to the second visual angle; inputting the second sample picture into a reference human body posture estimation network to obtain second three-dimensional feature data corresponding to the second visual angle; the reference human body posture estimation network and the human body posture estimation network have the same network structure; and training the human body posture estimation network according to the third three-dimensional characteristic data and the second three-dimensional characteristic data so as to adjust network parameters of the human body posture estimation network.

In an optional embodiment of the present invention, the training the human body posture estimation network according to the third three-dimensional feature data and the second three-dimensional feature data includes: and calculating a loss function according to the third three-dimensional characteristic data and the second three-dimensional characteristic data, and when the loss function does not meet the preset constraint condition, adjusting network parameters of the human body posture estimation network to train the human body posture estimation network, and terminating the training of the human body posture estimation network until the loss function meets the preset constraint condition.

The human body posture estimation network of the embodiment is obtained by training based on mutual constraint of intermediate data of another human body posture estimation network (referred to as a reference human body posture estimation network in the embodiment).

Specifically, fig. 2 is a schematic diagram of an application of the monocular human body posture estimation network training method according to the embodiment of the present invention; as shown in fig. 2, the complete network model may include a key point detection network and a human body posture estimation network, and the human body posture estimation network needs to be referred to for mutual constraint in the training process of the human body posture estimation network, assuming that the human body posture estimation network (Φ) in the first row in fig. 2 is the human body posture estimation network to be trained, and the human body posture estimation network (μ) in the second row is the reference human body posture estimation network.

Respectively dividing a first picture (I) of a first view angle_i) And a second picture (I) of a second view angle_j) Inputting a key point detection network, obtaining two-dimensional key points corresponding to the first picture, and performing interpolation processing on the obtained two-dimensional key points to obtain a two-dimensional skeleton picture (V)_i) (ii) a And obtaining two-dimensional key points corresponding to the second picture, and carrying out interpolation processing on the obtained two-dimensional key points to obtain a two-dimensional skeleton picture (V)_j)。

Two-dimensional skeleton picture (V)_i) Inputting a human body posture estimation network to obtain first three-dimensional feature data (G)_i) (ii) a At this time, the first three-dimensional feature data (G)_i) Is in a first view angle and corresponds to a first picture (I)_i) The three-dimensional feature of (a); two-dimensional skeleton picture (V)_j) Inputting a reference human body posture estimation network to obtain second three-dimensional feature data (G)_j) At this time, the second three-dimensional feature data (G)_j) Is in a first view angle and corresponds to a second picture (I)_j) Of the three-dimensional feature of (1).

For the first three-dimensional feature data (G)_i) According to the rotation parameter (R as shown in the figure)_i-j) Performing data processing to obtain third three-dimensional feature data (G) corresponding to the second viewing angle_ij). Wherein the preset rotation parameter (R shown in the figure)_i-j) Based on a degree of difference between the second perspective and the first perspective. As an example, if the first picture and the second picture are acquired in a manner that a plurality of image capturing devices are uniformly arranged in a circular region with a target object as a center and a distance R from the target object, the position and the corresponding number of each image capturing device can be determined, the difference between the viewing angles of any two image capturing devices and the target object can be further determined, and the rotation parameter can be predetermined and configured according to the difference; it is understood that a rotation parameter matching the first three-dimensional feature data (G) may be selected from a plurality of rotation parameters configured in advance according to the numbers corresponding to the first picture and the second picture, and the first three-dimensional feature data (G) may be processed according to the rotation parameter_i) And (5) performing rotation treatment. In practical applications, the rotation parameters may be implemented by a matrix.

Ideally, since the two-dimensional texture information is stripped from the two-dimensional skeleton picture, the common feature of the three-dimensional structure information related to the human body posture is kept, and the third three-dimensional feature data (G) is obtained_ij) Corresponding to the second three-dimensional characteristic data (G)_j) Are consistent. However, the two-dimensional picture does not have three-dimensional structure information, so the embodiment uses a weak supervised training mode to constrain the third three-dimensional feature data (G) through constraint conditions_ij) As close as possible to the second three-dimensional feature data (G)_j) And adjusting network parameters of the human body posture estimation network to train the human body posture estimation network. In practical application, the convergence condition of the loss function is set through the constraint condition, and the third three-dimensional characteristic data (G) is calculated based on a preset loss function calculation mode_ij) And second three-dimensional feature data (G)_j) The difference degree between the human body posture estimation network and the human body posture estimation network is adjusted when the calculated difference degree meets the convergence condition or not, and the convergence condition is not metNetwork parameters to train the human body posture estimation network; and when the convergence condition is met, terminating the training of the human body posture estimation network.

In one embodiment, the initial network model for training the human pose estimation network is used to obtain two-dimensional skeleton pictures of different view angles, called multi-view two-dimensional skeleton converter, as shown in fig. 2, one-view two-dimensional skeleton picture (V)_j) Inputting to a multi-view two-dimensional skeleton converter to obtain a two-dimensional skeleton picture (V) of another view_j'); correspondingly, a two-dimensional skeleton picture (V) of another view angle_j) Inputting to a multi-view two-dimensional skeleton converter to obtain a two-dimensional skeleton picture (V) of a view_i"). The multi-view two-dimensional framework converter comprises an encoder structure for extracting three-dimensional features, the encoder structure can be used as a human body posture estimation network in the application, the human body posture estimation network can be understood as a partial network structure of the multi-view two-dimensional framework converter, and the three-dimensional feature data output by the human body posture estimation network can be used as process data of the multi-view two-dimensional framework converter and can also be called as intermediate data. The present embodiment aims to perform supervised training on a human body posture estimation network for outputting three-dimensional feature data in the multi-view two-dimensional skeleton converter.

In practical application, a two-dimensional skeleton picture (V) of a first visual angle is taken_i) Inputting the data into the human body posture estimation network to obtain first three-dimensional characteristic data (G)_i) (ii) a For the first three-dimensional feature data (G)_i) According to the rotation parameter (R as shown in the figure)_i-j) Performing data processing to obtain third three-dimensional feature data (G) corresponding to the second viewing angle_ij) (ii) a The third three-dimensional feature data (G)_ij) Inputting a decoder structure (psi) in the multiview two-dimensional skeleton converter to obtain a two-dimensional skeleton picture (V) corresponding to the second view_j"). Correspondingly, a two-dimensional skeleton picture (V) from a second perspective_j) Obtaining a two-dimensional skeleton picture (V) of a first perspective_i' are processed in the same manner as described above, and are not described in detail here.

As an embodiment, the human body posture estimation network may be implemented by an encoder network. As an example, the encoder network may include at least a convolution layer, a Linear rectifying Unit (ReLU), a Batch Normalization (BN), and a BN layer, through which the three-dimensional feature data is obtained.

By adopting the technical scheme of the embodiment of the invention, on one hand, two-dimensional texture information is stripped by taking two-dimensional skeleton pictures at different visual angles as training data of a human body posture estimation network, and the common characteristic of three-dimensional structure information related to the human body posture is reserved; on the other hand, three-dimensional human body structure characteristics (namely first three-dimensional characteristic data and second three-dimensional characteristic data) representing human body structures are obtained through a weak supervision training mode based on the common characteristics (namely two-dimensional skeleton pictures at different visual angles), the obtained three-dimensional human body structure characteristics are fused into a human body posture estimation network according to preset constraint conditions, more accurate three-dimensional human body structure information is obtained in the fusion process, accordingly, the dependence of a network model on labeling data is reduced, the precision of the network model is greatly improved, and particularly, very high precision is still obtained under complex scenes of large-range human body posture change, background environment, camera visual angle change and the like.

The embodiment of the invention also provides an image processing method. FIG. 3 is a flowchart illustrating an image processing method according to an embodiment of the present invention; as shown in fig. 3, the method includes:

step 201: obtaining a picture to be processed; the picture to be processed comprises a target object;

step 202: acquiring two-dimensional key point information of the picture to be processed based on a key point detection network, and generating a two-dimensional skeleton picture corresponding to the target object based on the two-dimensional key point information;

step 203: and obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network.

In this embodiment, the training process of the human body posture estimation network may refer to the specific description in the above embodiments, and is not repeated here for brevity.

In an optional embodiment of the present invention, the obtaining target three-dimensional feature data corresponding to the target object based on the to-be-processed picture and the human body posture estimation network includes: inputting the picture to be processed into the human body posture estimation network to obtain initial three-dimensional characteristic data; and adjusting the initial three-dimensional characteristic data to obtain target three-dimensional characteristic data.

In this embodiment, the obtaining manner of the two-dimensional skeleton picture may refer to the detailed description of the obtaining manner of the first sample picture (or the second sample picture) in the foregoing embodiment, and the obtaining manner of the initial three-dimensional feature data may refer to the detailed description of the obtaining manner of the three-dimensional feature data in the human body posture estimation network, which is not repeated herein for brevity.

In this embodiment, the adjusting the initial three-dimensional feature data may be specifically adjusting for reducing data volume. For example, the three-dimensional key points corresponding to the initial three-dimensional feature data may be several hundreds, and the target three-dimensional key points corresponding to the obtained target three-dimensional feature data may be dozens of the target three-dimensional key points after the adjustment processing.

FIG. 4 is a schematic diagram illustrating an application of the image processing method according to the embodiment of the present invention; specifically, as shown in fig. 4, a two-dimensional picture is input to the key point detection network, where the two-dimensional picture may be a two-dimensional picture with any view angle; obtaining two-dimensional key points corresponding to the input picture through a key point detection network, and carrying out interpolation processing on the obtained two-dimensional key points to obtain a two-dimensional skeleton picture; inputting the two-dimensional skeleton picture into a human body posture estimation network to obtain initial three-dimensional characteristic data (G); and further carrying out fine adjustment processing on the initial three-dimensional characteristic data (G) through a Shallow Network (Shallow Network) to obtain target three-dimensional characteristic data.

The embodiment of the invention also provides a monocular human body posture estimation network training device; FIG. 5 is a schematic diagram of a structure of a monocular human posture estimation network training device according to an embodiment of the present invention; as shown in fig. 5, the apparatus includes a first processing unit 31 and a network training unit 32; wherein,

the first processing unit 31 is configured to obtain a first sample picture and a second sample picture; the first sample picture represents a two-dimensional skeleton picture under a first visual angle; the second sample picture represents a two-dimensional skeleton picture under a second visual angle;

the network training unit 32 is configured to train the human body posture estimation network according to the first sample picture, the second sample picture and a preset constraint condition, which are obtained by the first processing unit 31, so that after the first three-dimensional feature data obtained by the first sample picture through the human body posture estimation network is rotated according to a preset rotation relationship, the first three-dimensional feature data and the second three-dimensional feature data obtained by the second sample picture through the human body posture estimation network satisfy the preset constraint condition.

In an optional embodiment of the present invention, the network training unit 32 is configured to input the first sample picture into a human body posture estimation network, and obtain first three-dimensional feature data corresponding to the first view; performing data processing on the first three-dimensional characteristic data according to a preset rotation parameter to obtain third three-dimensional characteristic data corresponding to the second visual angle; inputting the second sample picture into a reference human body posture estimation network to obtain second three-dimensional feature data corresponding to the second visual angle; the reference human body posture estimation network and the human body posture estimation network have the same network structure; and training the human body posture estimation network according to the third three-dimensional characteristic data and the second three-dimensional characteristic data so as to adjust network parameters of the human body posture estimation network.

In an optional embodiment of the present invention, the network training unit 32 is configured to calculate a loss function according to the third three-dimensional feature data and the second three-dimensional feature data, and when the loss function does not satisfy the preset constraint condition, adjust a network parameter of the human body posture estimation network to train the human body posture estimation network, until the loss function satisfies the preset constraint condition, terminate training of the human body posture estimation network.

In an optional embodiment of the invention, the network training unit 32 is configured to determine a preset rotation parameter based on a degree of difference between the second perspective and the first perspective.

In an optional embodiment of the present invention, the first processing unit 31 is configured to obtain a first picture corresponding to a first view and a second picture corresponding to a second view, respectively; the first picture and the second picture correspond to a same sample target object; respectively obtaining first two-dimensional key point information of the first picture and second two-dimensional key point information of the second picture based on a key point detection network; and generating a first sample picture based on the first two-dimensional key point information, and generating a second sample picture based on the second two-dimensional key point information.

In the embodiment of the present invention, the first Processing Unit 31 and the network training Unit 32 in the apparatus may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA) in practical application.

It should be noted that: in the monocular human body posture estimation network training device provided in the above embodiment, when performing monocular human body posture estimation network training, only the division of the above program modules is taken as an example, and in practical applications, the above processing may be distributed to different program modules as needed, that is, the internal structure of the device may be divided into different program modules to complete all or part of the above-described processing. In addition, the monocular human body posture estimation network training device provided by the above embodiment and the monocular human body posture estimation network training method embodiment belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.

Fig. 6 is a schematic view of a seed structure of the image processing apparatus according to the embodiment of the present invention; as shown in fig. 6, the apparatus includes an acquisition unit 33 and an image processing unit 34; wherein,

the acquiring unit 33 is configured to acquire a picture to be processed; the picture to be processed comprises a target object;

the image processing unit 34 is configured to obtain two-dimensional key point information of the picture to be processed based on a key point detection network, and generate a two-dimensional skeleton picture corresponding to the target object based on the two-dimensional key point information; and obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network.

In this embodiment, the human body posture estimation network may be obtained by training based on the monocular human body posture estimation network training device of the embodiment of the present invention, and details are not repeated here.

In an optional embodiment of the present invention, the image processing unit 34 is configured to input the two-dimensional skeleton picture into the human body posture estimation network, so as to obtain initial three-dimensional feature data; and adjusting the initial three-dimensional characteristic data to obtain target three-dimensional characteristic data.

In the embodiment of the present invention, the obtaining unit 33 and the image processing unit 34 in the apparatus can be implemented by a CPU, a DSP, an MCU, or an FPGA in practical application.

It should be noted that: the image processing apparatus provided in the above embodiment is exemplified by the division of each program module when performing image processing, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 7 is a schematic diagram of a hardware structure of the processing apparatus according to the embodiment of the present invention, as shown in fig. 7, the apparatus includes a memory 42, a processor 41, and a computer program stored in the memory 42 and capable of running on the processor 41, and when the processor 41 executes the computer program, the steps of the monocular human body posture estimation network training method according to the embodiment of the present invention are implemented; alternatively, the processor 41 implements the steps of the image processing method according to the embodiment of the present invention when executing the program.

It will be appreciated that the various components within the processing device may be coupled together by a bus system 43. It will be appreciated that the bus system 43 is used to enable communications among the components. The bus system 43 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 43 in fig. 7.

It will be appreciated that the memory 42 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 42 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present invention may be applied to the processor 41, or implemented by the processor 41. The processor 41 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 41. The processor 41 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 41 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in memory 42, where processor 41 reads the information in memory 42 and in combination with its hardware performs the steps of the method described above.

In an exemplary embodiment, the monocular body pose estimation network training Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A monocular human body posture estimation network training method is characterized by comprising the following steps:

2. The method of claim 1, wherein the training of the human pose estimation network according to the first sample picture, the second sample picture and preset constraints comprises:

3. The method of claim 2, wherein training the body pose estimation network based on the third three-dimensional feature data and the second three-dimensional feature data comprises:

4. A method according to claim 2 or 3, wherein the preset rotation parameter is determined based on the degree of difference between the second view and the first view.

5. The method according to any one of claims 1 to 4, wherein the obtaining the first sample picture and the second sample picture comprises:

6. An image processing method, characterized in that the method adopts the human body posture estimation network trained by the method of any one of claims 1-5 to perform image processing, and the method comprises the following steps:

obtaining a picture to be processed; the picture to be processed comprises a target object;

acquiring two-dimensional key point information of the picture to be processed based on a key point detection network, and performing interpolation processing on two-dimensional key points corresponding to the two-dimensional key point information to generate a two-dimensional skeleton picture corresponding to the target object;

obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network;

obtaining target three-dimensional feature data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network, wherein the obtaining of the target three-dimensional feature data comprises:

and processing the two-dimensional skeleton picture based on the human body posture estimation network to obtain three-dimensional characteristic data corresponding to one visual angle, and setting a rotation relation for the three-dimensional characteristic data according to the human body posture to rotate to obtain the target three-dimensional characteristic data corresponding to the other visual angle.

7. The method according to claim 6, wherein the obtaining target three-dimensional feature data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network comprises:

8. A monocular human body posture estimation network training device is characterized by comprising a first processing unit and a network training unit; wherein,

9. The apparatus according to claim 8, wherein the network training unit is configured to input the first sample picture into a human body pose estimation network, and obtain first three-dimensional feature data corresponding to the first view angle; performing data processing on the first three-dimensional characteristic data according to a preset rotation parameter to obtain third three-dimensional characteristic data corresponding to the second visual angle; inputting the second sample picture into a reference human body posture estimation network to obtain second three-dimensional feature data corresponding to the second visual angle; the reference human body posture estimation network and the human body posture estimation network have the same network structure; and training the human body posture estimation network according to the third three-dimensional characteristic data and the second three-dimensional characteristic data so as to adjust network parameters of the human body posture estimation network.

10. The apparatus according to claim 9, wherein the network training unit is configured to calculate a loss function according to the third three-dimensional feature data and the second three-dimensional feature data, and when the loss function does not satisfy the preset constraint condition, adjust a network parameter of the body posture estimation network to train the body posture estimation network until the loss function satisfies the preset constraint condition, and terminate training of the body posture estimation network.

11. The apparatus according to claim 9 or 10, wherein the network training unit is configured to determine a preset rotation parameter based on a degree of difference between the second perspective and the first perspective.

12. The apparatus according to any of claims 8 to 11, wherein the first processing unit is configured to obtain a first picture corresponding to a first view and a second picture corresponding to a second view, respectively; the first picture and the second picture correspond to a same sample target object; respectively obtaining first two-dimensional key point information of the first picture and second two-dimensional key point information of the second picture based on a key point detection network; and generating a first sample picture based on the first two-dimensional key point information, and generating a second sample picture based on the second two-dimensional key point information.

13. An image processing apparatus characterized by comprising an acquisition unit and an image processing unit; wherein,

the image processing unit is used for obtaining two-dimensional key point information of the picture to be processed based on a key point detection network, performing interpolation processing on two-dimensional key points corresponding to the two-dimensional key point information, and generating a two-dimensional skeleton picture corresponding to the target object; obtaining target three-dimensional characteristic data corresponding to the target object based on the two-dimensional skeleton picture and the human body posture estimation network;

the image processing unit is used for processing the two-dimensional skeleton picture based on the human body posture estimation network to obtain three-dimensional feature data corresponding to one visual angle, and setting a rotation relation for the three-dimensional feature data according to the human body posture to rotate to obtain target three-dimensional feature data corresponding to another visual angle; the human body posture estimation network is obtained based on the training of the monocular human body posture estimation network training device of any one of claims 8 to 12.

14. The apparatus according to claim 13, wherein the image processing unit is configured to input the two-dimensional skeleton picture into the human body posture estimation network to obtain initial three-dimensional feature data; and adjusting the initial three-dimensional characteristic data to obtain target three-dimensional characteristic data.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5; alternatively, the program is adapted to carry out the steps of the method of any one of claims 6 to 7 when executed by a processor.

16. A monocular body pose estimation network training device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method of any one of claims 1 to 5.

17. An image processing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 6 to 7 are implemented when the program is executed by the processor.