WO2022131390A1

WO2022131390A1 - Self-supervised learning-based three-dimensional human posture estimation method using multi-view images

Info

Publication number: WO2022131390A1
Application number: PCT/KR2020/018365
Authority: WO
Inventors: 윤주홍; 박민규; 장인호; 김제우
Original assignee: 한국전자기술연구원
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2022-06-23
Also published as: KR20220085491A

Abstract

Provided is a self-supervised learning-based three-dimensional human posture estimation method using multi-view images. A three-dimensional human posture estimation method according to an embodiment of the present invention uses a self-supervised learning method among deep learning techniques to restore a three-dimensional posture by using only an image, two-dimensional posture information, and camera parameters. Accordingly, it is possible to optimize a network by using only its own input data and two-dimensional posture labels without three-dimensional label data.

Description

Self-supervised learning-based three-dimensional human posture estimation method using multi-view images

The present invention relates to artificial intelligence technology for image processing, and more particularly, to a method of restoring a three-dimensional posture from an image, two-dimensional posture information, and camera parameters using a self-supervised learning method.

Existing 3D human posture information restoration technology mainly uses a method to restore the 3D position of the joint by directly attaching a large number of expensive sensors to the person, but it is not suitable for use in real life due to dependence on additional equipment and inconvenience. .

Recently, the deep learning performance has been explosively improved due to the improvement of computer performance, the increase in the number of data, and the improvement of algorithms.

However, deep learning techniques optimize the performance of the model based on the error by comparing a large number of input data and correct answer (label) data, but there are several problems in applying it to 3D posture estimation as follows.

First, when training a deep learning supervised learning model, if there is not a large amount of label data, the model cannot be trained.

Second, 3D human posture data takes a lot of time, human resources, and equipment resources to make label data differently from 2D human posture data.

Third, since many resources are involved, the number of published data itself is small, and there is only limited data for training deep learning models.

Because of these problems, supervised deep learning technology cannot exhibit its performance without label data.

The present invention has been devised to solve the above problems, and an object of the present invention is to improve the supervised learning 3D human posture estimation method that was dependent on 3D label data, thereby providing a self-supervised learning method that is relatively free from label data. It is intended to provide a method that can be used for general-purpose posture estimation.

According to an embodiment of the present invention for achieving the above object, a network learning method for estimating a 3D posture is a 3D method of a first view from an input image of a first view using a network for 3D posture estimation. a first estimation step of estimating a posture; a second estimation step of estimating a 3D posture of a second viewpoint from an input image of a second viewpoint using a network for 3D posture estimation; a first rotation step of rotating the three-dimensional posture of the first viewpoint to the second viewpoint; using the loss function between the 3D posture obtained in the second estimation step and the 3D posture obtained in the first rotation step to learn a network for 3D posture estimation.

A network learning method for three-dimensional posture estimation according to an embodiment of the present invention comprises: a second rotation step of rotating a three-dimensional posture of a second view to a first view; and using the loss function between the 3D posture obtained in the first estimation step and the 3D posture obtained in the second rotation step to learn a network for 3D posture estimation.

The first rotation step is performed by applying a camera rotation matrix that converts the first point of view of the camera to the second point of view, and the three-dimensional posture of the first point of view is rotated to the second point of view, and the second rotation step includes the second point of view of the camera. By applying a camera rotation matrix that converts the viewpoint to the first viewpoint, the 3D posture of the second viewpoint may be rotated to the first viewpoint.

A network learning method for three-dimensional posture estimation according to an embodiment of the present invention includes: a first transformation step of converting a three-dimensional posture obtained in a first rotation step into a two-dimensional posture of a second viewpoint; Using the loss function between the two-dimensional attitude of the second viewpoint obtained in the first transformation step and the two-dimensional attitude label of the input image of the second viewpoint, the step of learning a network for 3-dimensional attitude estimation; have.

A network learning method for three-dimensional posture estimation according to an embodiment of the present invention comprises: a second transformation step of converting the three-dimensional posture obtained in the second rotation step into a two-dimensional posture of a first viewpoint; Using a loss function between the two-dimensional attitude of the first viewpoint obtained in the second transformation step and the two-dimensional attitude label of the input image of the first viewpoint, the step of learning a network for 3-D attitude estimation; may further include have.

The first conversion step uses the camera parameters to convert the three-dimensional posture obtained in the first rotation step into the two-dimensional posture at the second viewpoint, and the second conversion step uses the camera parameters to convert the second rotation step It is possible to convert the obtained 3D posture into the 2D posture of the first viewpoint.

The input image may be a two-dimensional input image.

The network learning method for three-dimensional posture estimation according to an embodiment of the present invention may further include: estimating a three-dimensional posture by inputting an input image to a learned network for three-dimensional posture estimation.

The network learning method for 3D posture estimation according to an embodiment of the present invention may further include outputting the 3D posture estimated in the estimation step together with an input image.

On the other hand, the 3D posture estimation system according to another embodiment of the present invention estimates the 3D posture of the first view from the input image of the first view by using a network for 3D posture estimation, and the second view point an estimator for estimating a three-dimensional posture of a second viewpoint from an input image of ; a rotating unit that rotates the three-dimensional posture of the first viewpoint to the second viewpoint; and a learning unit for learning a network for 3D posture estimation using a loss function between the 3D posture obtained from the estimator and the 3D posture obtained from the rotation unit.

Meanwhile, according to another embodiment of the present invention, a method for estimating a 3D posture includes: estimating a 3D posture by inputting an input image to a network for estimating a 3D posture; outputting the information on the 3D posture estimated in the estimation step; including, wherein the network for 3D posture estimation uses the network for 3D posture estimation from the input image of the first view point of the first view Estimate the 3D posture, estimate the 3D posture of the second view from the input image of the second view using a network for 3D posture estimation, rotate the 3D posture of the first view to the second view, It is learned using a loss function between the 3D posture of the second viewpoint obtained through estimation and the 3D posture of the second viewpoint obtained through rotation.

Meanwhile, according to another embodiment of the present invention, a three-dimensional posture estimation system includes: an estimator for estimating a three-dimensional posture by inputting an input image to a network for estimating a three-dimensional posture; and an output unit for outputting information on the 3D posture estimated by the estimator, wherein the network for 3D posture estimation uses the network for 3D posture estimation from the input image of the first view point to the first view point. Estimate the 3D posture of the second view from the input image of the second view using the network for 3D posture estimation, and rotate the 3D posture of the first view to the second view, , is learned using a loss function between the 3D posture of the second viewpoint obtained through estimation and the 3D posture of the second viewpoint obtained through rotation.

As described above, according to the embodiments of the present invention, the three-dimensional human posture estimation model, which was dependent on label data, is improved in a self-supervised learning method to obtain a three-dimensional human posture using only images, two-dimensional postures, and camera parameters. can be estimated.

In addition, according to embodiments of the present invention, by using a camera rotation matrix to provide a three-dimensional posture of different views, it is possible to eliminate the need for three-dimensional human posture label data.

And, according to embodiments of the present invention, it is possible to estimate the posture by applying it to all situations where there are camera settings and two-dimensional label data.

1 is a configuration diagram of supervised deep learning for posture estimation;

2 is a configuration diagram of self-supervised deep learning;

3 is a view showing a three-dimensional human posture estimation system according to an embodiment of the present invention;

4 is a flowchart showing a three-dimensional posture information acquisition process for self-supervised learning;

5 is a diagram schematically illustrating a process of generating a three-dimensional posture of view A and a three-dimensional posture of view B using a two-dimensional input image of view A;

6 is a diagram schematically illustrating a process of generating a three-dimensional posture of view B and a three-dimensional posture of view A using a two-dimensional input image of view B;

7 is a flowchart showing a two-dimensional posture information acquisition process for self-supervised learning;

8 is a diagram provided to explain a method for learning a network for 3D posture estimation.

Hereinafter, the present invention will be described in more detail with reference to the drawings.

In an embodiment of the present invention, a self-supervised learning-based three-dimensional human posture estimation method using a multi-view image is presented.

In the AR/VR-based service, it is essential that the device recognizes the user's current posture and shape for interaction between the device and the user. In the embodiment of the present invention, the image, The 3D posture is restored using only the 2D posture information and camera parameters.

In an embodiment of the present invention, a self-supervised learning model using a two-dimensional human posture label and camera parameters without hard-to-find three-dimensional human posture label data is presented.

1 is a block diagram of a supervised deep learning model for posture estimation. Because supervised learning optimizes by backpropagating the error between the model result and the label data, the model can be trained only when the input and the label exist in pairs.

Since the self-supervised learning model utilized in the embodiment of the present invention optimizes the error of the two-dimensional projection of the model result and the rotated result with the camera parameter and the two-dimensional human posture label, it is self-guided without the three-dimensional label data. Network optimization can be done with only the input data of 2 is a block diagram illustrating the concept of a self-supervised deep learning model.

Furthermore, in an embodiment of the present invention, a deep learning model is trained by rotating the three-dimensional posture estimation result with different camera rotation matrices, converting them into three-dimensional postures in each direction, and calculating the self-loss function.

3 is a diagram illustrating a three-dimensional human posture estimation system according to an embodiment of the present invention. 3D human posture estimation system according to an embodiment of the present invention, as shown, the input unit 110, the output unit 120, the estimator 130, the learning unit 140, the rotating unit 150 and the converting unit (160) is included.

The input unit 110 is a means for receiving a learning image and an inference image. A two-dimensional multi-view image is input as a learning image and an inference image, and in the learning mode, two-dimensional posture information of the two-dimensional image is input as a label.

The estimator 130 is a deep learning network that estimates a 3D posture using a 2D image input through the input unit 110 . The network extracts useful features using Resnet50, which shows good performance in image feature detection, and estimates the three-dimensional posture through the Fully Connected Layer.

The output unit 120 displays and outputs the 3D posture information estimated by the estimator 130 on the input image.

The rotating unit 150 rotates the three-dimensional posture estimated by the estimator 130 to a posture of another viewpoint. To this end, the rotation unit 150 applies the camera rotation matrix to the estimated three-dimensional posture and converts it into a three-dimensional posture of different viewpoints.

The conversion unit 160 converts the three-dimensional posture rotated by the rotating unit 150 into a two-dimensional posture using the internal camera parameters.

A detailed method of acquiring posture information by the estimator 130 , the rotation unit 150 , and the transform unit 160 will be described later in detail with reference to FIGS. 4 to 7 .

The learning unit 140 uses the three-dimensional posture estimated by the estimator 130 , the three-dimensional posture rotated by the rotating unit 150 , and the two-dimensional posture converted by the transform unit 160 and the input unit 110 . A deep learning network for 3D posture estimation is trained using the input 2D posture label.

To this end, the learning unit 140 calculates a self-loss function between the three-dimensional postures and a loss function of the two-dimensional posture. A specific learning method by the learning unit 140 will be described later in detail with reference to FIG. 8 .

4 is a flowchart illustrating a three-dimensional posture information acquisition process for self-supervised learning.

As shown, first, from the 2D input image of the viewpoint A input through the input unit 110 , the estimator 130 estimates the 3D attitude of the viewpoint A using a network for estimating the 3D attitude.

Then, the rotating unit 150 rotates the 3D posture of the viewpoint A estimated by the estimator 130 to the viewpoint B, and converts the 3D posture of the viewpoint B into the 3D posture.

FIG. 5 schematically shows a process of generating a 3D posture of view A and a 3D posture of view B using the 2D input image of view A. Referring to FIG.

Similarly, as shown in FIG. 4 , from the 2D input image of the viewpoint B input through the input unit 110 , the estimator 130 calculates the 3D attitude of the viewpoint B using a network for 3D attitude estimation. estimate

Then, the rotation unit 150 rotates the 3D posture of the viewpoint B estimated by the estimator 130 to the viewpoint A, and converts the 3D posture of the viewpoint A into the 3D posture.

6 schematically shows a process of generating the 3D posture of view B and the 3D posture of view A using the 2D input image of view B.

7 is a flowchart illustrating a two-dimensional posture information acquisition process for self-supervised learning.

As shown, first, the converting unit 160 converts the three-dimensional attitude of the B viewpoint rotated from the A viewpoint to the B viewpoint by the rotation unit 150 into a two-dimensional attitude of the B viewpoint by using the camera parameters.

Similarly, the conversion unit 160 converts the three-dimensional posture of the viewpoint A rotated from the viewpoint B to the viewpoint A by the rotation unit 150 into the two-dimensional attitude of the viewpoint A by using the camera parameters.

FIG. 8 is a diagram provided to explain a method for the learning unit of FIG. 3 to learn a network for 3D posture estimation by using the posture information generated through FIGS. 4 and 7 .

As shown, first, the learning unit 140 learns a network for 3D posture estimation by using the self-loss function between the following two 3D postures.

1) The three-dimensional posture 312 of view B obtained by rotating the three-dimensional posture 310 of view A estimated from the two-dimensional posture 210 of view A

2) The three-dimensional posture 320 of the viewpoint B estimated from the two-dimensional attitude 220 of the viewpoint B

In addition, the learning unit 140 learns a network for estimating the 3D posture by using the self-loss function between the following two 3D postures.

1) The three-dimensional posture 310 of the viewpoint A estimated from the two-dimensional posture 210 of the viewpoint A

2) The three-dimensional posture 321 of view A obtained by rotating the three-dimensional posture 320 of view B estimated from the two-dimensional posture 220 of view B

In addition, the learning unit 140 further learns the network for estimating the three-dimensional posture by using the following loss function between the two two-dimensional postures.

1) The two-dimensional posture 212 of view B obtained by converting the three-dimensional posture 312 of view B obtained by rotating the three-dimensional posture 310 of view A estimated from the two-dimensional posture 210 of view A )

2) the label 220' of the two-dimensional posture 220 at the point B

In addition, the learning unit 140 further trains the network for estimating the 3D posture by using the following loss function between the two 2D postures.

1) The two-dimensional posture 221 of the point A obtained by converting the three-dimensional posture 321 of the viewpoint A obtained by rotating the three-dimensional attitude 320 of the viewpoint B estimated from the two-dimensional attitude 220 of the viewpoint B )

2) The label 210' of the two-dimensional posture 210 at the point A

So far, a preferred embodiment of the self-supervised learning-based three-dimensional human posture estimation method using multi-view images has been described in detail.

In the above example, a general-purpose posture estimation model using a self-supervised learning method that is relatively free from label data was presented in a supervised 3D human posture estimation method that was dependent on 3D label data.

With the self-supervised learning method, it is possible to estimate a three-dimensional human posture estimation model, which was dependent on label data, with images, two-dimensional posture, and camera parameters. It becomes possible to eliminate the need for three-dimensional human posture label data.

On the other hand, it goes without saying that the technical idea of the present invention can be applied to a computer-readable recording medium containing a computer program for performing the functions of the apparatus and method according to the present embodiment. In addition, the technical ideas according to various embodiments of the present invention may be implemented in the form of computer-readable codes recorded on a computer-readable recording medium. The computer-readable recording medium may be any data storage device readable by the computer and capable of storing data. For example, the computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, or the like. In addition, the computer-readable code or program stored in the computer-readable recording medium may be transmitted through a network connected between computers.

In addition, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims In addition, various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

Claims

a first estimation step of estimating a 3D posture of a first view from an input image of a first view using a network for 3D posture estimation;

a second estimation step of estimating a 3D posture of a second viewpoint from an input image of a second viewpoint using a network for 3D posture estimation;

a first rotation step of rotating the three-dimensional posture of the first viewpoint to the second viewpoint;

Using the loss function between the 3D posture obtained in the second estimation step and the 3D posture obtained in the first rotation step, learning a network for 3D posture estimation; 3D posture comprising: A network training method for estimation.
3. The method according to claim 2,

a second rotation step of rotating the three-dimensional posture of the second viewpoint to the first viewpoint; and

Using the loss function between the 3D posture obtained in the first estimation step and the 3D posture obtained in the second rotation step, learning a network for 3D posture estimation; A network training method for posture estimation.
3. The method according to claim 2,

The first rotation step is

By applying a camera rotation matrix that converts the first view point of the camera to the second view point, the three-dimensional posture of the first view point is rotated to the second view point,

The second rotation step is

A network learning method for 3D posture estimation, characterized in that the 3D posture of the second view is rotated to the first view by applying a camera rotation matrix that converts the second view point of the camera into the first view point.
The method according to claim 1,

a first conversion step of converting the three-dimensional posture obtained in the first rotation step into a two-dimensional posture of a second viewpoint;

Using the loss function between the two-dimensional attitude of the second viewpoint obtained in the first transformation step and the two-dimensional attitude label of the input image of the second viewpoint, the step of learning a network for estimating the three-dimensional attitude; A network learning method for three-dimensional posture estimation characterized by.
5. The method according to claim 4,

a second transformation step of converting the three-dimensional posture obtained in the second rotation step into the two-dimensional posture of the first viewpoint;

Using a loss function between the two-dimensional attitude of the first viewpoint obtained in the second transformation step and the two-dimensional attitude label of the input image of the first viewpoint to train a network for 3-dimensional attitude estimation; further comprising A network learning method for three-dimensional posture estimation characterized by
6. The method of claim 5,

The first conversion step is

Using the camera parameters, the three-dimensional posture obtained in the first rotation step is converted into the two-dimensional posture of the second viewpoint,

The second transformation step is

A network learning method for three-dimensional posture estimation, characterized in that by using camera parameters, the three-dimensional posture obtained in the second rotation step is converted into the two-dimensional posture of the first viewpoint.
The method according to claim 1,

input video,

A network learning method for three-dimensional posture estimation, characterized in that it is a two-dimensional input image.
The method according to claim 1,

Estimating the 3D posture by inputting the input image into a network for estimating the learned 3D posture; The network learning method for 3D posture estimation further comprising:
9. The method of claim 8,

A network learning method for estimating a three-dimensional posture, comprising: outputting the three-dimensional posture estimated in the estimation step together with an input image.
an estimator for estimating the 3D posture of the first view from the input image of the first view by using a network for 3D posture estimation, and estimating the 3D posture of the second view from the input image of the second view;

a rotating unit that rotates the three-dimensional posture of the first viewpoint to the second viewpoint;

3D posture estimation system comprising: a learning unit for learning a network for 3D posture estimation by using a loss function between the 3D posture obtained from the estimator and the 3D posture obtained from the rotation unit.
estimating a three-dimensional posture by inputting an input image into a network for estimating a three-dimensional posture;

Including; outputting information about the three-dimensional posture estimated in the estimation step;

The network for 3D posture estimation is,

Estimating the 3D posture of the first view from the input image of the first view using the network for 3D posture estimation,

Estimate the 3D posture of the second view from the input image of the second view using the network for 3D posture estimation,

Rotating the three-dimensional posture of the first viewpoint to the second viewpoint,

A three-dimensional posture estimation method, characterized in that it is learned using a loss function between the three-dimensional posture of the second view obtained through estimation and the three-dimensional posture of the second view obtained through rotation.
an estimator for estimating a three-dimensional posture by inputting an input image into a network for estimating a three-dimensional posture;

Including; an output unit for outputting information about the three-dimensional posture estimated by the estimator;

The network for 3D posture estimation is,

Estimating the 3D posture of the first view from the input image of the first view using the network for 3D posture estimation,

Estimate the 3D posture of the second view from the input image of the second view using the network for 3D posture estimation,

Rotating the three-dimensional posture of the first viewpoint to the second viewpoint,

A three-dimensional posture estimation system, characterized in that it is learned using a loss function between the three-dimensional posture of the second view obtained through estimation and the three-dimensional posture of the second view obtained through rotation.