KR20230081378A

KR20230081378A - Multi-view semi-supervised learning for 3D human pose estimation

Info

Publication number: KR20230081378A
Application number: KR1020210169371A
Authority: KR
Inventors: 장주용; 김도엽
Original assignee: 광운대학교 산학협력단
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2023-06-07

Abstract

3차원 휴먼 자세 추정을 위한 단시점 모델의 다시점 준지도 학습 시스템이 개시된다. 상기 시스템은 참값(GT)이 없는 다시점 영상

을 입력받고, 다시점 영상에 대한 의사 참값(P-GT)

를 출력하는 P-GT 생성부; 상기 P-GT 생성부에서 생성된 P-GT와 상기 다시점 영상과 함께 입력되고 단시점 모델의 학습에 사용되는 단시점 모델 학습부; 그리고 상기 단시점 모델 학습부에서 학습된 단시점 모델은 입력된 단시점 영상으로부터의 3차원 휴먼 자세 추정에 사용되는 3차원 휴먼 자세 추정부를 포함하며, 상기 단시점 영상은 3차원 휴먼 자세 추정부의 학습된 단시점 모델에 입력되어 3차원 휴먼 자세 추정 결과를 출력한다.
3차원 휴먼 자세 추정 모델은 다시점 모델과 단시점 모델로 분류된다. 일반적으로 다시점 모델은 단시점 모델에 비하여 뛰어난 자세 추정 성능을 보인다. 단시점 모델의 경우 3차원 자세 추정 성능의 향상은 많은 양의 학습 데이터를 필요로 한다. 우리는 다시점 모델로부터 다시점 휴먼 자세 데이터에 대한 의사 참값을 생성하고, 이를 단시점 모델의 학습에 활용하는 방법을 제안한다. 또한, 각각의 다시점 영상으로부터 추정된 자세의 일관성을 고려하는 다시점 일관성 손실함수를 제안하여, 단시점 모델의 효과적인 학습에 도움을 준다는 것을 보인다.A multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation is disclosed. The system is a multi-view image without a true value (GT)

is input, and the pseudo true value (P-GT) for the multi-view image

a P-GT generation unit that outputs; a single-view model learning unit inputted together with the P-GT generated by the P-GT generation unit and the multi-view image and used for learning a single-view model; The single-view model learned by the single-view model learning unit includes a 3D human posture estimator used for estimating a 3D human posture from the input single-view image, and the single-view image is learned by the 3D human posture estimator. 3D human posture estimation result is output.
3D human posture estimation models are classified into multi-view models and single-view models. In general, multi-view models show superior posture estimation performance compared to single-view models. In the case of single-view models, improvement of 3D posture estimation performance requires a large amount of training data. We propose a method for generating pseudo true values for multi-view human posture data from multi-view models and using them for learning single-view models. In addition, we propose a multi-view coherence loss function that considers the coherence of postures estimated from each multi-view image, and shows that it helps effective learning of single-view models.

Description

Multi-view semi-supervised learning for 3D human pose estimation {Multi-view semi-supervised learning for 3D human pose estimation}

본 발명은 3차원 휴먼 자세 추정(3D human pose estimation)을 위한 단시점 모델의 다시점 준지도 학습 시스템에 관한 것이다.The present invention relates to a multi-view semi-supervised learning system of a single-view model for 3D human pose estimation.

이 논문은 2021 년도 정부(과학기술정보통신부)의 재원으로정보통신기획평가원의 지원을 받아 수행된 연구이다(No. 2021-0-00348, 2D/3D 영상 통합 분석을 이용한 클라우드 기반 무인점포 환경 대응형 영상보안시스템 개발)This thesis is a study conducted with the support of the Information and Communication Planning and Evaluation Institute with financial resources from the government (Ministry of Science and ICT) in 2021 (No. 2021-0-00348, Cloud-based unmanned store environment response using 2D/3D video integrated analysis Hyungyeong video security system development)

사용자에 대한 3차원 골격(skeleton) 정보를 빠르고 신뢰성 있게 추출하는 것은 컴퓨터 비전 분야에서 가장 중요한 문제 중의 하나다. 3차원 골격 정보 추출 기술은 사용자-컴퓨터 상호작용(Human Computer Interaction, HCI), 컴퓨터 그래픽스를 위한 모션 캡처(motion capture), 영상 감시를 위한 동작 인식(gesture/action recognition) 및 의료 서비스(health care) 등의 다양한 응용 분야를 가지고 있다. 특별히 3차원 골격 정보 추출 기술은 스마트 기기나 인터렉티브 디지털 콘텐츠를 위한 보다 자연스러운 사용자 인터페이스(natural userinterface)를 가능하게 하는 핵심 기술로서 최근 많은 관심을 갖고 있다.Extracting 3D skeleton information about a user quickly and reliably is one of the most important problems in the field of computer vision. 3D skeletal information extraction technology is used for human computer interaction (HCI), motion capture for computer graphics, gesture/action recognition for video surveillance, and health care. It has various application fields such as In particular, 3D skeleton information extraction technology has recently attracted a lot of attention as a key technology enabling a more natural user interface for smart devices or interactive digital contents.

3차원 휴먼 자세 추정(3D human pose estimation) 방법은 크게 다시점 모델(multi-view model)과 단시점 모델(single-view model)로 구분될 수 있다.The 3D human pose estimation method can be largely divided into a multi-view model and a single-view model.

한 자세에 대한 여러 카메라 시점의 영상을 입력으로 사용하는 다시점 모델은 단시점 모델보다 정확한 자세 추정이 가능하다. 그 이유는 다시점 모델이 3차원 휴먼 자세 추정 시 깊이 모호성(depth ambiguity) 문제와 영상 시점에 따른 가리워짐(occlusion) 문제에 강인한 모델을 다시점 영상으로부터 학습할 수 있기 때문이다.A multi-view model that uses images from multiple camera viewpoints for one posture as input can estimate a more accurate posture than a single-view model. The reason is that the multi-view model can learn a model that is robust to the depth ambiguity problem and the occlusion problem according to the video viewpoint when estimating the 3D human posture from the multi-view image.

단시점 모델은 단일 시점의 영상 입력으로부터 3차원 휴먼 자세를 추정하는 방법으로 최근 딥러닝의 발전과 함께 큰 성능 증가를 보였다. 그러나 여전히 다시점 모델에 비하여 깊이 모호성 문제와 가리워짐 문제에 취약하다. 단시점 모델의 성능 개선은 다양한 시점과 자세를 포함하는 대량의 정제된 데이터를 필요로 한다. 그러나 3차원 자세에 대한 참값(GT; ground-truth)을 제공하는 데이터를 획득하는 일은 일반적으로 많은 시간과 비용을 필요로 한다.The single-view model is a method of estimating the three-dimensional human posture from a single-view image input, and has shown a significant increase in performance with the recent development of deep learning. However, it is still vulnerable to depth ambiguity and occlusion problems compared to multi-view models. Improving the performance of the single-view model requires a large amount of refined data including various viewpoints and postures. However, obtaining ground-truth (GT) data for a three-dimensional posture generally requires a lot of time and money.

이와 관련된 선행기술1로써, 특허공개번호 10-2015-0061488에서는 "3차원 사용자 자세 추정 방법 및 장치"가 공개되어 있다. As prior art 1 related to this, Patent Publication No. 10-2015-0061488 discloses “a method and apparatus for estimating a 3D user posture”.

3차원 사용자 자세 추정 장치는 사용자의 깊이 영상으로부터 사용자 몸에 해당하는 사용자 영역을 추출하고, 사용자 영역 내의 픽셀 각각에 대해 랜덤 결정 포레스트(Randomized Decision Forest)를 이용하여 상기 골격 관절 영역에서의 중심점을 결정하며, 골격 관절 영역에서의 중심점을 이용하여 골격 정보를 생성한 후 골격 정보로부터 사용자의 자세를 추정한다The 3D user posture estimation apparatus extracts a user region corresponding to the user's body from a depth image of the user, and determines a center point in the skeletal joint region by using a randomized decision forest for each pixel in the user region. Then, the user's posture is estimated from the skeleton information after generating skeleton information using the center point in the skeleton joint area.

3차원 사용자 자세 추정 방법은, 3차원 사용자 자세 추정 장치에서 3차원 사용자 자세를 추정하는 방법으로서, 사용자의 깊이 영상으로부터 사용자 몸에 해당하는 사용자 영역을 추출하는 단계; 상기 사용자 영역 내의 픽셀 각각에 대해 랜덤 결정 포레스트(Randomized Decision Forest)를 이용하여 상기 골격 관절 영역에서의 중심점을 결정하는 단계; 그리고 상기 골격 관절 영역에서의 중심점을 이용하여 골격 정보를 생성하는 단계를 포함한다.A 3D user posture estimating method is a method of estimating a 3D user posture using a 3D user posture estimating apparatus, comprising: extracting a user region corresponding to the user's body from a depth image of the user; determining a central point in the skeletal joint region using a randomized decision forest for each pixel in the user region; and generating skeletal information using a center point in the skeletal joint region.

3차원 휴먼 자세 추정은 영상에 존재하는 사람의 자세를 3차원 공간에서 관절의 좌표로 표현하는 것이다. 3차원 휴먼 자세 추정 모델은 단시점 영상 입력을 받는 단시점 모델과 다시점 영상을 입력받는 다시점 모델로 분류할 수 있다. 3D human posture estimation is to express the posture of a person existing in an image with joint coordinates in a 3D space. The 3D human posture estimation model can be classified into a single-view model receiving a single-view image input and a multi-view model receiving a multi-view image input.

도 1은 단시점 모델과 다시점 모델의 각 방법의 대략적인 예시를 보여준다. 단시점 모델은 단일 시점의 영상을 입력 받아 3차원 휴먼 자세를 출력한다. 단시점 모델은 3차원 관절 좌표의 깊이 값을 추정하기 어려운 깊이 모호성 문제에 취약하며, 영상 내의 인물이 다른 물체에 의해 가리워지는 경우에도 강인하지 못한 단점이 있다. 반면, 다시점 모델은 다시점 영상으로 학습되며, 한 자세에 대한 다른 시점에서의 영상들을 입력 받기 때문에 이로부터 깊이 모호성 문제와 가리워짐 문제에 강인한 모델이 학습된다. 따라서 다시점 모델은 단시점 모델보다 정확한 자세를 추정할 수 있다. 단시점 모델의 성능 개선은 다양한 시점과 자세를 포함하는 대량의 데이터를 요구한다. 그러나, 3차원 자세 참값(GT; ground-truth)을 포함하는 데이터를 획득하는 것은 많은 시간과 비용을 필요로 한다.1 shows a rough example of each method of a single-view model and a multi-view model. The single-view model receives an image from a single view point and outputs a 3D human posture. The short-view model is vulnerable to the problem of depth ambiguity, which makes it difficult to estimate the depth value of 3D joint coordinates, and is not robust even when a person in the image is covered by another object. On the other hand, the multi-view model is trained with multi-view images, and since images from different viewpoints for one posture are input, a model that is robust to depth ambiguity and occlusion problems is learned. Therefore, the multi-view model can estimate a more accurate posture than the single-view model. Improving the performance of the single-view model requires a large amount of data including various viewpoints and postures. However, obtaining data including a three-dimensional attitude ground-truth (GT) requires a lot of time and money.

특허 공개번호 10-2015-0061488 (공개일자 2015년 06월 04일), "3차원 사용자 자세 추정 방법 및 장치", 한국전자통신연구원Patent Publication No. 10-2015-0061488 (published on June 04, 2015), "Method and apparatus for estimating 3D user posture", Korea Electronics and Telecommunications Research Institute

[1] K. Iskakov, E. Burkov, V. Lempisky, and Y. Malkov, "Learnable triangulation of human pose," ICCV, 2019.[1] K. Iskakov, E. Burkov, V. Lempisky, and Y. Malkov, "Learnable triangulation of human pose," ICCV, 2019. [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," NIPS, 2012.[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," NIPS, 2012. [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition," CVPR, 2016.[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, 2016. [4] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” ECCV, 2018.[4] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” ECCV, 2018. [5] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, "Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, 2013.[5] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, "Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol . 36, 2013. [6] S. Park and N. Kwak, "3D human pose estimation with relational networks," BMVC, 2018.[6] S. Park and N. Kwak, "3D human pose estimation with relational networks," BMVC, 2018. [7] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, "Coarse-to-fine volumetric prediction for single-image 3D human pose," CVPR, 2017.[7] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, "Coarse-to-fine volumetric prediction for single-image 3D human pose," CVPR, 2017. [8] J. C. Gower, "Generalized procrustes analysis," Psychometrika, vol. 40, no. 2, 1975.[8] J. C. Gower, "Generalized procrustes analysis," Psychometrika, vol. 40, no. 2, 1975. [9] D. P. Kingma and L. J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.[9] D. P. Kingma and L. J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. [10] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshops, 2017[10] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch, ” NIPS Workshops, 2017

상기 문제점을 해결하기 위한 본 발명의 목적은 GT가 없는 카메라 캘리브레이션된 다시점 영상 데이터를 활용한 단시점 3차원 휴먼 자세 추정 시스템의 준지도 학습 시스템을 제안하며, 3차원 휴먼 자세 추정 방법을 정량적, 정성적으로 평가하고, 평가 결과로부터 다시점 모델로부터 획득된 P-GT가 단시점 모델의 학습 및 성능 개선을 보이는, 3차원 휴먼 자세 추정을 위한 단시점 모델의 다시점 준지도 학습 시스템을 제공하는 것이다. An object of the present invention to solve the above problems is to propose a semi-supervised learning system for a single-view 3D human posture estimation system using camera-calibrated multi-view image data without a GT, and to provide a 3D human posture estimation method in a quantitative, Provides a multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation, in which the P-GT obtained from the multi-view model is qualitatively evaluated and the P-GT obtained from the multi-view model shows learning and performance improvement of the single-view model will be.

본 발명의 목적을 달성하기 위해, 3차원 휴먼 자세 추정을 위한 단시점 모델의 다시점 준지도 학습 시스템은, 참값(GT)이 없는 다시점 영상

을 입력받고, 다시점 영상에 대한 의사 참값(P-GT)

를 출력하는 P-GT 생성부; 상기 P-GT 생성부에서 생성된 P-GT와 상기 다시점 영상과 함께 입력되고 단시점 모델의 학습에 사용되는 단시점 모델 학습부; 그리고 상기 단시점 모델 학습부에서 학습된 단시점 모델은 입력된 단시점 영상으로부터의 3차원 휴먼 자세 추정에 사용되는 3차원 휴먼 자세 추정부를 포함하며, In order to achieve the object of the present invention, a multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation is a multi-view image without a true value (GT).

is input, and the pseudo true value (P-GT) for the multi-view image

a P-GT generation unit that outputs; a single-view model learning unit inputted together with the P-GT generated by the P-GT generation unit and the multi-view image and used for learning a single-view model; The single-view model learned in the single-view model learning unit includes a 3-dimensional human posture estimation unit used for estimating a 3-dimensional human posture from an input single-view image,

상기 단시점 영상은 3차원 휴먼 자세 추정부의 학습된 단시점 모델에 입력되어 3차원 휴먼 자세 추정 결과를 출력한다. The single-view image is input to the learned single-view model of the 3D human posture estimator to output a 3D human posture estimation result.

본 발명에 따른 3차원 휴먼 자세 추정을 위한 단시점 모델의 다시점 준지도 학습 시스템은 3차원 휴먼 자세 추정 방법을 정량적, 정성적으로 평가한다. 그리고 평가 결과로부터 다시점 모델로부터 획득된 P-GT가 단시점 모델의 학습 및 성능 개선에 활용된다. The multi-view semi-supervised learning system of the single-view model for 3-dimensional human posture estimation according to the present invention quantitatively and qualitatively evaluates the 3-dimensional human posture estimation method. And the P-GT obtained from the multi-view model from the evaluation results is used to learn and improve the performance of the single-view model.

본 연구는 휴먼 객체의 3차원 자세 추정을 위한 단시점 모델의 성능을 개선하기 위해 캘리브레이션 된 unlabeled 다시점 데이터셋을 활용하는 준지도 학습 방법을 제안한다. 제안하는 방법은 다시점 데이터에 다시점 모델을 적용하여 P-GT를 생성하고, 이를 단시점 모델의 미세 조정에 활용한다. 또한 우리는 다시점 입력 영상에 대한 3차원 휴먼 자세 추정의 일관성을 고려하는 다시점 일관성 손실 함수를 제안한다. 실험을 통해 우리는 기존의 사전 학습된 다시점 모델에 의해 생성된 P-GT와 다시점 일관성 손실 함수가 단시점 모델의 성능을 정량적, 정성적으로 향상시킴을 확인하였다.This study proposes a semi-supervised learning method using a calibrated unlabeled multi-view dataset to improve the performance of a single-view model for 3D posture estimation of human objects. The proposed method generates P-GT by applying a multi-view model to multi-view data and uses it to fine-tune the single-view model. In addition, we propose a multi-view coherence loss function that considers the consistency of 3D human posture estimation for multi-view input images. Through experiments, we confirmed that the P-GT and the multi-view coherence loss function generated by the existing pretrained multi-view model quantitatively and qualitatively improve the performance of the single-view model.

도 1은 다시점 모델과 단시점 모델
도 2는 본 발명에 따른 3차원 휴먼 자세 추정 시스템의 개요
도 3은 본 발명의 실시예에 따른 3차원 휴먼 자세 추정 시스템의 블록도
도 4는 Baseline, 제안하는 방법, GT의 정성적 비교1 is a multi-view model and a single-view model
2 is an overview of a 3D human posture estimation system according to the present invention
3 is a block diagram of a 3D human posture estimation system according to an embodiment of the present invention.
Figure 4 is a qualitative comparison of baseline, proposed method, and GT

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 발명의 구성 및 동작을 상세하게 설명한다. 본 발명의 설명에 있어서 관련된 공지의 기술 또는 공지의 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 자세한 설명을 생략한다. 또한, 첨부된 도면 번호는 동일한 구성을 표기할 때에 다른 도면에서 동일한 도면번호를 부여한다. Hereinafter, the configuration and operation of a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the description of the present invention, if it is determined that a detailed description of a related known technology or known configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. In addition, the attached drawing numbers refer to the same drawing numbers in other drawings when indicating the same configuration.

3차원 휴먼 자세 추정 모델은 다시점 모델과 단시점 모델로 분류될 수 있다. 일반적으로 다시점 모델은 단시점 모델에 비하여 뛰어난 자세 추정 성능을 보인다. 단시점 모델의 경우 3차원 자세 추정 성능의 향상은 많은 양의 학습 데이터를 필요로 한다. 하지만 3차원 자세에 대한 참값을 획득하는 것은 쉬운 일이 아니다. 이러한 문제를 다루기 위해, 우리는 다시점 모델로부터 다시점 휴먼 자세 데이터에 대한 의사 참값을 생성하고, 이를 단시점 모델의 학습에 활용하는 방법을 제안한다. 또한, 우리는 각각의 다시점 영상으로부터 추정된 자세의 일관성을 고려하는 다시점 일관성 손실함수를 제안하여, 단시점 모델의 효과적인 학습에 도움을 준다는 것을 보인다.The 3D human posture estimation model can be classified into a multi-view model and a single-view model. In general, multi-view models show superior posture estimation performance compared to single-view models. In the case of single-view models, improvement of 3D posture estimation performance requires a large amount of training data. However, it is not easy to obtain the true value of the 3D posture. To address this problem, we propose a method for generating pseudo true values for multi-view human posture data from multi-view models and using them for learning single-view models. In addition, we propose a multi-view coherence loss function that considers the coherence of the posture estimated from each multi-view image, and shows that it helps the effective learning of single-view models.

본 연구에서 3차원 자세 GT가 제공되지 않는(unlabeled), 캘리브레이션된 다시점 데이터셋을 가정하고, 이러한 데이터셋을 활용하여 단시점 모델의 성능을 개선하는 방법을 제안한다. 기본적인 아이디어는 사전 학습된 다시점 모델[1]을 unlabeled 다시점 데이터셋에 적용하고, 그 추정 결과를 다시점 영상들에 대응하는 3차원 휴먼 자세에 대한 의사 참값(P-GT; pseudo-GT)으로서 단시점 모델의 학습에 활용하는 것이다. 또한, 우리는 다시점 영상에 대한 단시점 모델의 자세 추정 결과들에 일관성을 부여하는 다시점 일관성 손실함수(multi-view consistency loss)를 제안한다. 이는 단시점 모델의 3차원 깊이 추정 성능과 가리워짐 발생 시 휴먼 자세 추정 성능을 개선한다.In this study, we assume a calibrated multi-view dataset in which 3D posture GT is not provided (unlabeled), and propose a method to improve the performance of a single-view model using this dataset. The basic idea is to apply a pre-learned multi-view model [1] to an unlabeled multi-view dataset, and the estimation result is a pseudo-true value (P-GT) for 3D human poses corresponding to multi-view images. As such, it is used for learning a single-point model. In addition, we propose a multi-view consistency loss function that gives consistency to the posture estimation results of a single-view model for multi-view images. This improves the 3D depth estimation performance of the single-view model and the human posture estimation performance when occlusion occurs.

우리는 제안하는 3차원 휴먼 자세 추정 방법을 정량적, 정성적으로 평가한다. 그리고 평가 결과로부터 다시점 모델로부터 획득된 P-GT가 단시점 모델의 학습 및 성능 개선에 활용될 수 있음을 보인다.We quantitatively and qualitatively evaluate the proposed 3D human posture estimation method. And from the evaluation results, it is shown that the P-GT obtained from the multi-view model can be used for learning and performance improvement of the single-view model.

본 발명은 참값(GT)이 없는 카메라 캘리브레이션된 다시점 영상 데이터를 활용한 단시점 3차원 휴먼 자세 추정 시스템의 준지도 학습 시스템을 제안한다. The present invention proposes a semi-supervised learning system for a single-view 3D human posture estimation system using camera-calibrated multi-view image data without a true value (GT).

도 2는 다시점 영상 데이터를 활용한 단시점 3차원 휴먼 자세 추정 시스템의 준지도 학습 시스템 개요를 나타냈다. 2 shows an overview of the semi-supervised learning system of a single-view 3D human posture estimation system using multi-view image data.

본 발명의 3차원 휴먼 자세 추정을 위한 단시점 모델의 다시점 준지도 학습 시스템은, 참값(GT)이 없는 다시점 영상

(101)을 입력받고, 다시점 영상에 대한 의사 참값(P-GT)

를 출력하는 P-GT 생성부(102); The multi-view semi-supervised learning system of the single-view model for 3-dimensional human posture estimation of the present invention is a multi-view image without a true value (GT).

(101) is input, and the pseudo true value (P-GT) for the multi-view image

a P-GT generation unit 102 that outputs;

상기 P-GT 생성부(102)에서 생성된 P-GT와 상기 다시점 영상(101)과 함께 입력되고 단시점 모델의 학습에 사용되는 단시점 모델 학습부(103); 그리고 a single-view model learning unit 103 inputted together with the P-GT generated by the P-GT generation unit 102 and the multi-view image 101 and used for learning a single-view model; and

상기 단시점 모델 학습부(103)에서 학습된 단시점 모델은 입력된 단시점 영상(107)으로부터의 3차원 휴먼 자세 추정에 사용되는 3차원 휴먼 자세 추정부(108)를 포함하며, The single-view model learned in the single-view model learning unit 103 includes a 3-dimensional human posture estimation unit 108 used for 3-dimensional human posture estimation from the input single-view image 107,

상기 단시점 영상(107)은 3차원 휴먼 자세 추정부(108)의 학습된 단시점 모델에 입력되어 3차원 휴먼 자세 추정 결과를 출력한다. The single-view image 107 is input to the learned single-view model of the 3D human posture estimator 108 to output a 3D human posture estimation result.

본 발명에서 제안하는 시스템의 구체적인 절차는 도 3의 블록도로 표현된다. 먼저 참값(GT)이 없는 다시점 영상

는 P-GT 생성부(102)에 입력된다. P-GT 생성부(102)는 다시점 영상(101)에 대한 의사 참값(P-GT; pseudo-GT)

를 출력한다. 이 때,

는

개의 3차원 관절 좌표이며 시점

에서 촬영된 영상에 대응하는 P-GT이다. 생성된 P-GT는 다시점 영상과 함께 단시점 모델 학습부(103)로 입력되어 단시점 모델의 학습에 사용된다. 마지막으로 다시점 영상으로부터 학습된 단시점 모델은 3차원 휴먼 자세 추정부(108)로써 단시점 영상을 입력 받아 시스템의 3차원 휴먼 자세 추정 결과를 출력한다.A detailed procedure of the system proposed by the present invention is represented by a block diagram in FIG. 3 . First, a multi-view image without a true value (GT)

is input to the P-GT generation unit 102. The P-GT generating unit 102 generates a pseudo-true value (P-GT) for the multi-view image 101.

outputs At this time,

Is

The three-dimensional joint coordinates of the dog and the viewpoint

This is the P-GT corresponding to the video taken in . The generated P-GT is input to the single-view model learning unit 103 together with the multi-view image and used for learning the single-view model. Finally, the single-view model learned from the multi-view image receives the single-view image as the 3D human posture estimator 108 and outputs the 3D human posture estimation result of the system.

P-GT 생성부(102)는 다시점 영상(101)으로부터 각 시점의 영상에 대한 P-GT를 생성한다. P-GT 생성부(102)는 사전 학습된 다시점 모델과 정규화를 통한 P-GT 생성 과정으로 구성된다. 다시점 모델은 다시점 영상(101)을 입력받고, 인공 신경망(artificial neural network)으로 구성된 2차원 관절 좌표 출력 네트워크와 algebraic triangulation 방법으로 구현된다. 다시점 모델은 입력으로부터 월드 좌표계(world coordinate system)에서 정의된 J개의 3차원 관절 좌표

를 출력한다. The P-GT generation unit 102 generates P-GTs for images of each viewpoint from the multi-view image 101 . The P-GT generation unit 102 is composed of a pretrained multi-view model and a P-GT generation process through regularization. The multi-view model receives the multi-view image 101 and is implemented with a 2-dimensional joint coordinate output network composed of an artificial neural network and an algebraic triangulation method. The multi-view model has J three-dimensional joint coordinates defined in the world coordinate system from the input.

outputs

다음으로

로부터 단시점 모델의 학습에 활용할 수 있는 P-GT를 생성하기 위해

를 각 시점

에 대한 영상

에 대응하는 히트맵 공간으로 정규화(normalization)한다. 이는 캘리브레이션된 카메라 파라메터를 사용한

의

에 대한 원근 투영(perspective projection), 픽셀 좌표에 대한 정규화, 그리고 깊이 좌표에 대한 정규화 과정으로 이루어진다. 먼저,

가

의

번째 관절이라고 할 때,

의

에 대한 원근 투영을 위하여

를 카메라 좌표계로 변환한다. 이 과정은 다음 식과 같이 수행된다.to the next

To generate a P-GT that can be used for learning a single-point model from

at each point

video about

Normalize to the heatmap space corresponding to . This is done using calibrated camera parameters.

of

It consists of perspective projection for , normalization for pixel coordinates, and normalization for depth coordinates. first,

go

of

When referring to the second joint,

of

for the perspective projection of

to the camera coordinate system. This process is performed as follows.

위 식에서

와

는 시점 c에 대한 카메라 외부 파라메터를 의미한다. 이를 통해 월드 좌표계의 3차원 좌표

는 카메라 좌표계로 변환된다. 그 후, 카메라 내부 파라메터

를 사용하여

가

에 원근 투영된 2차원 관절 좌표

를 획득한다. 이는 다음 식과 같이 수행된다.in the above expression

and

denotes an external parameter of the camera for the point in time c. Through this, the three-dimensional coordinates of the world coordinate system

is converted to the camera coordinate system. After that, camera internal parameters

use with

go

2D joint coordinates projected in perspective

Acquire This is done as follows.

획득한 2차원 좌표에 대한 정규화는 다음 두 식을 거쳐 수행된다.Normalization of the obtained two-dimensional coordinates is performed through the following two equations.

는 입력 영상과 히트맵 크기 사이의 비율인 4로 나누어져 입력 영상보다 4배 작은 히트맵 공간에서의

좌표가 된다. 이 때 입력 영상의 크기는

, 히트맵의 크기는

이다. 깊이 좌표에 대한 정규화는 다음 식과 같이 수행된다.

is divided by 4, the ratio between the input image and the heatmap size, and in the heatmap space 4 times smaller than the input image

becomes the coordinates. In this case, the size of the input image is

, the size of the heatmap is

am. Normalization for depth coordinates is performed as follows.

각 관절의 깊이 좌표를 정규화 하기 위하여 우리는 먼저 휴먼 객체의 크기가

이하임을 가정한다.

에서 골반 관절의 깊이 값인

를 빼서 골반 관절을 기준으로 상대적으로 정의되는 깊이 값을 얻는다. 이러한 관절 좌표의 깊이 값은

의 범위에 존재한다. 또한 정규화된 깊이 값이 히트맵의 뎁스 축 범위인

내에 존재하도록 만든다. 마지막으로 정규화된

를 합쳐서 우리는 정규화된 관절 좌표

를 얻는다.To normalize the depth coordinates of each joint, we first determine the size of the human object

Assume below

is the depth value of the pelvic joint in

is subtracted to obtain a depth value defined relative to the pelvic joint. The depth values of these joint coordinates are

exists in the range of Also, the normalized depth value is the range of the depth axis of the heatmap.

make it exist within last normalized

By summing, we get the normalized joint coordinates

get

위와 같은 정규화를 각 관절과 모든 시점에서 수행하여 P-GT 생성부는 다시점 영상의 각 시점에 대한 P-GT

를 출력한다.By performing the above normalization at each joint and all viewpoints, the P-GT generation unit calculates the P-GT for each viewpoint of the multi-view image.

outputs

단시점 모델 학습부(103)는 다시점 영상(101)과 생성된 P-GT를 사용하여 단시점 모델을 학습한다. 이 때, P-GT 뿐 아니라 사용 가능한 GT가 존재하는 다시점 영상 데이터도 함께 활용할 수 있다. 단시점 모델은

를 구성하는

개의 영상을 각각 입력받아 자세 추정 결과인

를 출력한다. 단시점 모델은 인공 신경망으로 구성된 encoder

, decoder

로 이루어지며 입력은

와

를 거쳐 3차원 히트맵을 생성한다. 마지막으로 생성된 히트맵에 soft-argmax를 적용하여 단시점 모델은 각 관절의 3차원 좌표를 출력한다. 이제 모델의 학습을 위하여 출력된 관절 좌표와 입력 영상에 대응하는 P-GT 또는 GT에 기반하여 L1 손실함수를 계산한다. 또한,

의 각 영상에서 얻어진 모든 휴먼 자세가 같은 자세임을 보장하기 위하여 모델의 각 추정 결과들을 월드 좌표계로 변환하고 모든 추정 결과들 사이에 L1 손실함수를 계산한다. 월드 좌표계로의 변환은 정규화 과정의 역연산으로 수행된다. 자세 추정 결과들의 일관성을 보장하는 손실함수를 본 발명에서는 '다시점 일관성 손실함수'로 정의한다. 두 손실함수의 가중 합으로부터 역전파 알고리즘을 통해 단시점 모델을 학습한다.The single-view model learning unit 103 learns a single-view model using the multi-view image 101 and the generated P-GT. At this time, not only the P-GT but also multi-view image data in which available GTs exist can be utilized together. The single point model is

constituting

Each image is input and the posture estimation result is

outputs The single-point model is an encoder composed of artificial neural networks.

, decoder

and the input is

and

to generate a 3D heat map. Finally, by applying soft-argmax to the generated heatmap, the single-view model outputs the 3D coordinates of each joint. Now, for model learning, the L1 loss function is calculated based on the output joint coordinates and the P-GT or GT corresponding to the input image. also,

In order to ensure that all human postures obtained from each image of are the same posture, each estimation result of the model is converted into a world coordinate system, and an L1 loss function is calculated between all estimation results. Conversion to the world coordinate system is performed as an inverse operation of the normalization process. A loss function that ensures consistency of attitude estimation results is defined as a 'multi-view consistency loss function' in the present invention. A single-point model is learned from the weighted sum of the two loss functions through the backpropagation algorithm.

단시점 모델 학습부(103)에서 학습된 단시점 모델은 3차원 휴먼 자세 추정부(108)에서 단시점 영상(107)으로부터의 3차원 휴먼 자세 추정에 사용된다. 단시점 영상(107)은 3차원 휴먼 자세 추정부(108)의 학습된 단시점 모델에 입력되어 3차원 휴먼 자세 추정 결과를 출력한다.The single-view model learned in the single-view model learning unit 103 is used in the 3-dimensional human posture estimation unit 108 to estimate the 3-dimensional human posture from the single-view image 107 . The short-view image 107 is input to the learned single-view model of the 3D human posture estimator 108 to output a 3D human posture estimation result.

2. 제안하는 방법2. How to propose

본 연구에서 제안하는 단시점 모델의 성능 개선을 위해 unlabeled 다시점 영상 데이터셋을 활용하는 방법의 대략적인 절차는 다음과 같다. 첫 번째, 사전 학습된 다시점 모델로부터 P-GT를 생성한다. 두 번째, GT를 포함하는 labeled 데이터셋으로 사전 학습된 모델로 단시점 모델을 초기화 한다. 세 번째, P-GT를 포함하는 unlabeled 다시점 영상 데이터셋 및 GT를 포함하는 labeled 데이터셋을 함께 사용하여 단시점 모델을 추가 학습한다. 이 때 제안하는 다시점 일관성 손실함수와 함께 단시점 모델을 최적화한다.The approximate procedure for using unlabeled multi-view image datasets to improve the performance of the single-view model proposed in this study is as follows. First, a P-GT is generated from a pretrained multi-view model. Second, we initialize a single-point model with a pre-trained model on a labeled dataset containing GT. Third, a single-view model is additionally trained using an unlabeled multi-view image dataset containing P-GT and a labeled dataset containing GT together. At this time, the single-view model is optimized with the proposed multi-view coherence loss function.

2.1. 단시점 모델의 사전 학습2.1. Pre-training of single-point models

본 연구에서 3차원 휴먼 자세 추정을 수행하는 단시점 모델은 ImageNet[2]으로 사전 학습된 ResNet-50[3]을 백본(backbone)으로 하는 적분 회귀(integral regression) 모델[4]이다. 단시점 모델을 구성하기 위해 ResNet-50 모델에서 global average pooling 층과 fully connected 층을 3개의 연속된 deconvolution 층과 하나의 1x1 convolution 층으로 바꿔 fully convolutional 네트워크

를 만든다. 각 deconvolution 층의 필터 크기와 stride는 각각 4와 2로 설정한다. 단시점 모델은

의 출력 텐서에 soft-argmax[4] 연산을 적용하여 3차원 휴먼 자세를 구성하는 관절 좌표를 획득한다. The single-view model that performs 3D human posture estimation in this study is an integral regression model [4] with ResNet-50 [3] pretrained with ImageNet [2] as a backbone. To construct a single-point model, a fully convolutional network was created by replacing the global average pooling layer and the fully connected layer with three consecutive deconvolution layers and one 1x1 convolution layer in the ResNet-50 model.

makes The filter size and stride of each deconvolution layer are set to 4 and 2, respectively. The single point model is

The joint coordinates constituting the 3D human posture are obtained by applying the soft-argmax[4] operation to the output tensor of .

구체적으로 네트워크

는 입력 영상

을 입력 받아

개 관절에 대한 3차원 히트맵(heatmap)

를 출력한다. 그 후,

에 soft-argmax를 적용하여 각 관절의 3차원 좌표

를 획득한다. network specifically

is the input image

take input

3D heatmap for dog joints

outputs After that,

3D coordinates of each joint by applying soft-argmax to

Acquire

Soft-argmax 연산은 히트맵을 확률 분포로 만들고 기댓값(expectation)을 계산하여 공간 좌표를 획득하는 방법으로 다음 식은

번째 관절에 대한 3차원 히트맵

를 확률 분포

로 만들기 위해 softmax 연산을 적용하는 과정을 나타낸다:Soft-argmax operation is a method of obtaining spatial coordinates by making the heat map into a probability distribution and calculating the expected value.

3D heatmap for the joint

the probability distribution

shows the process of applying the softmax operation to make

(1)

(One)

그 후 　　

에 다음과 같은 기댓값 연산을 적용하여 특징점 좌표

를 획득한다:After that

by applying the following expected value operation to

get:

(2)

우리는 식 (2)를 모든 관절에 적용하여 3차원 휴먼 자세

를 획득한다. 단시점 모델의 사전 학습을 위해 우리는 모델이 출력한

에 labeled 데이터셋의 GT에 기반한 L1 손실 함수를 적용한다.We applied Equation (2) to all joints to obtain a 3D human posture.

Acquire For pre-training of the single-point model, we

We apply the L1 loss function based on the GT of the labeled dataset to .

2.2. P-GT 데이터셋 생성2.2. Create P-GT dataset

우리는 unlabeled 다시점 영상 데이터셋을 활용하기 위하여 사전 학습된 다시점 모델

로부터

개 관절들의 좌표

를 추정한다. 다시점 모델

는 [1]의 algebraic triangulation 모델을 사전 학습하여 사용한다.We use a pre-trained multi-view model to utilize an unlabeled multi-view image dataset.

from

Coordinates of Dog Joints

to estimate multi-view model

uses the pre-trained algebraic triangulation model of [1].

, (3)

여기서

는 unlabeled 다시점 영상 데이터셋의 한 sample이다.

와

는 각각 시점

에서 촬영된 영상과 한 자세를 관찰하는 서로 다른 시점의 개수를 의미한다. 여기서 3차원 휴먼 자세

는 월드 좌표계(world coordinate system)에서 정의된다.here

is a sample of an unlabeled multi-view image dataset.

and

is each point

It means the number of different viewpoints observing the image taken in and one posture. 3D human pose here

is defined in the world coordinate system.

를 단시점 모델의 학습에 활용하기 위해 우리는

를 각 시점

의 영상

에 대응하는 히트맵 공간으로 정규화(normalization)한다. 이는

의

에 대한 원근 투영(perspective projection), 픽셀 좌표에 대한 정규화, 그리고 깊이 좌표에 대한 정규화 과정들로 이루어진다.

In order to use for the training of a single-point model, we

at each point

video of

Normalize to the heatmap space corresponding to . this is

of

It consists of perspective projection for , normalization for pixel coordinates, and normalization for depth coordinates.

음 식은 관절 좌표

를

에 투영하여 2차원 좌표를 획득하고, 픽셀 좌표에 대해 정규화 하는 과정을 나타낸다:Equation is joint coordinates

cast

It shows the process of obtaining two-dimensional coordinates by projecting on and normalizing them to pixel coordinates:

, (4)

, (5)

(6)

와

는 카메라

의 외부 파라미터를,

and

the camera

the external parameters of

와

는 카메라

의 외부 파라미터를, 그리고

는 내부 파라미터를 나타낸다. 식 (4)를 통해 월드 좌표계에서 정의된

는 카메라 좌표계에서 정의된 관절 좌표

로 변환된다.

는 식 (5)를 통해 픽셀 좌표

로 투영된다. 마지막으로 식 (6)을 통해 우리는 영상 크기와 히트맵 크기 사이의 비율인 4를 사용하여 정규화된 2차원 좌표

를 얻을 수 있다. 또한 우리는

의 깊이 좌표

를 다음과 같이 정규화한다:

and

the camera

the external parameters of , and

represents an internal parameter. defined in the world coordinate system through equation (4)

is the joint coordinates defined in the camera coordinate system

is converted to

is the pixel coordinate through equation (5)

projected into Finally, through Equation (6), we obtain the normalized two-dimensional coordinates using 4, the ratio between the image size and the heatmap size.

can be obtained. Also, we

depth coordinates of

is normalized as follows:

. (7)

이하임을 가정한다.

에서 골반 관절의 깊이 값인

를 빼서 골반 관절을 기준으로 상대적으로 정의되는 깊이 값을 얻는다. 이러한 깊이 값은

의 범위에 존재하므로 우리는 추가적으로, 정규화된 깊이 값이 히트맵의 뎁스 축 범위인

내에 존재하도록 만든다. 이러한 과정은 식 (7)에 나타나 있으며, 이를 통해 우리는 정규화된 깊이 값

를 획득한다. 결국 시점

에 대응하는 히트맵 공간에서 정의되는 P-GT 관절 좌표

는 다음과 같다:To normalize the depth coordinates of each joint, we first determine the size of the human object

Assume below

is the depth value of the pelvic joint in

is subtracted to obtain a depth value defined relative to the pelvic joint. These depth values are

Since it exists in the range of we additionally, the normalized depth value is the range of the depth axis of the heatmap

make it exist within This process is shown in Equation (7), through which we obtain the normalized depth value

Acquire the point at the end

P-GT joint coordinates defined in the heat map space corresponding to

is as follows:

. (8)

식 (4)-(8)을 각 시점

에 적용하여 우리는 unlabeled sample

에 대한 P-GT

를 획득할 수 있고, 이를 활용하여 우리는 P-GT 데이터셋

을 구성한다.Equation (4)-(8) at each time point

By applying to the unlabeled sample

For P-GT

can be obtained, and by using this, we have the P-GT dataset

make up

2.3. 다시점 일관성 손실함수2.3. Multi-Point Consistency Loss Function

제안하는 방법은 다시점 영상 데이터를 사용하여 단시점 모델을 학습한다. 단시점 모델이 한 자세에 대응하는 다시점 영상을 입력 받는 경우, 모델에 의해 추정된 각 시점에서의 휴먼 자세들은 일관된 자세를 취해야 한다. 따라서 우리는 학습된 모델로 하여금 이러한 조건을 만족시키게끔 하기 위해 다시점 일관성 손실함수를 제안한다. 다시점 일관성 손실함수는 각 시점에 대해 추정된 관절 좌표들을 월드 좌표계 기준으로 변환하고, 그 결과 자세들 사이의 L1 손실 함수들의 합으로 정의된다. 히트맵 공간으로 정규화된 관절 좌표를 월드 좌표계로 변환하는 과정은 식 (4)-(8)의 역 연산으로 수행된다. 이제 다시점 일관성 손실함수

는 다음과 같다:The proposed method learns a single-view model using multi-view image data. When a single-view model receives a multi-view image corresponding to one posture, the human postures at each view point estimated by the model must take a consistent posture. Therefore, we propose a multi-view consistency loss function to make the trained model satisfy these conditions. The multi-view coherence loss function is defined as the sum of the L1 loss functions between the resulting postures after converting the estimated joint coordinates for each viewpoint into the world coordinate system. The process of converting the joint coordinates normalized to the heat map space into the world coordinate system is performed by the inverse operation of Eqs. (4)-(8). Now the multi-point consistency loss function

is as follows:

, (9)

여기서,

와

는 각각 시점

와

에서 단시점 모델에 의해 추정된 정규화된 관절 좌표를 나타내며,

는 정규화된 관절 좌표를 월드 좌표계 기준으로 변환하는 함수이다.here,

and

is each point

and

Represents the normalized joint coordinates estimated by the single-view model in

is a function that converts normalized joint coordinates to world coordinate system standards.

우리는 사전 학습된 단시점 모델을 미세 조정(fine-tuning) 하기 위한 손실 함수

을 다음과 같이 정의한다:We use a loss function to fine-tune a pretrained single-point model.

is defined as:

, (10)

여기서

는 단시점 모델의 출력

에 적용하는 P-GT 및 GT에 기반한 L1 손실 함수들의 합으로 정의된다.

와

는 각 손실 함수의 영향력을 결정하는 가중치이다. 단시점 모델의 미세 조정을 위해 우리는 식 (10)을 최소화한다.here

is the output of the single-point model

It is defined as the sum of L1 loss functions based on P-GT and GT applied to .

and

is a weight that determines the influence of each loss function. For fine-tuning of the single-point model, we minimize Eq.

3. 실험 결과 3. Experimental results

3.1. 데이터셋, 평가 방법, 구현 세부사항3.1. Datasets, evaluation methods, and implementation details

본 연구는 제안하는 방법을 학습, 평가하기 위하여 대규모의 3차원 휴먼 자세를 포함하는 Human3.6M[5] 데이터셋을 사용한다. Human3.6M 데이터셋에서 각 휴먼 객체(subject, 이하 S)은 15가지의 동작을 수행하며 각 휴먼 객체가 동작을 수행하는 비디오를 4개의 서로 다른 시점의 카메라로 촬영한다. 우리는 기존 연구들[6, 7]의 학습 및 평가 방법에 따라 11명 중 5명(S1, S5, S6, S7, S8)의 인물에 대한 데이터를 학습 데이터셋으로 사용한다. 이 중 3명(S1, S5, S6)의 데이터는 labeled 데이터셋으로, 2명(S7, S8)의 데이터는 unlabeled 데이터셋으로 가정한다. 나머지 2명(S9, S11)의 데이터는 평가 데이터셋으로 사용한다. 우리는 평가 데이터셋을 64프레임 마다 서브 샘플링(sub-sampling)하여 사용한다.This study uses the Human3.6M [5] dataset, which includes large-scale 3D human postures, to learn and evaluate the proposed method. In the Human3.6M dataset, each human object (hereinafter referred to as S) performs 15 actions, and videos of each human object performing the action are taken with four different viewpoint cameras. We use personal data of 5 out of 11 (S1, S5, S6, S7, S8) people as a learning dataset according to the learning and evaluation methods of previous studies [6, 7]. Among them, the data of 3 students (S1, S5, S6) is assumed to be a labeled dataset, and the data of 2 students (S7, S8) is assumed to be an unlabeled dataset. The data of the other two (S9, S11) is used as an evaluation dataset. We use the evaluation dataset by sub-sampling every 64 frames.

우리는 단시점 모델의 성능을 정량적으로 평가하기 위해 MPJPE와 PA-MPJPE를 측정하여 보고한다. MPJPE는 평가 데이터셋에서 단시점 모델에 의해 추정된 관절과 그 GT 사이의 유클리드 거리를 나타낸다. PA-MPJPE는 추정된 관절과 GT 사이에 Procrustes alignment[8]를 수행한 후 MPJPE를 구한 값이다. MPJPE와 PA-MPJPE의 단위는

이다.We measure and report MPJPE and PA-MPJPE to quantitatively evaluate the performance of the single-point model. MPJPE represents the Euclidean distance between a joint and its GT estimated by the single-point model in the evaluation dataset. PA-MPJPE is the value obtained after performing Procrustes alignment [8] between the estimated joint and GT. The units of MPJPE and PA-MPJPE are

am.

다시점 모델과 단시점 모델의 사전 학습은 labeled 데이터셋으로 수행된다. P-GT 데이터셋의 생성시 학습의 효율성을 위하여 사전 학습된 다시점 모델을 적용한 자세 추정 결과를 오프라인으로 저장한다. 그 후 단시점 모델의 미세 조정 시 사용한다.Pre-training of multi-view and single-view models is performed with labeled datasets. For the efficiency of learning when creating the P-GT dataset, the posture estimation result by applying the pretrained multi-view model is stored offline. After that, it is used to fine-tune the single-view model.

다시점 모델의 사전 학습에는 labeled 데이터셋이 사용되며, 에포크(epoch) 수, 배치 크기(batch size), 학습률(learning rate)은 각각 6, 8,

로 설정한다. 단시점 모델의 사전 학습 또한 같은 학습 데이터셋이 사용되며, 에포크 수, 배치 크기, 학습률은 각각 20, 32,

이다. 두 모델의 사전 학습에 사용된 optimizer는 Adam[9] 이다.A labeled dataset is used for pre-training of the multi-view model, and the number of epochs, batch size, and learning rate are 6, 8, and 8, respectively.

set to The same training dataset is used for pre-training of the single-point model, and the number of epochs, batch size, and learning rate are 20, 32, and 32, respectively.

am. The optimizer used for pre-training of the two models is Adam [9].

단시점 모델의 미세 조정을 위해 우리는 labeled 데이터셋과 P-GT 데이터셋으로 9 에포크 동안 학습한다.For fine-tuning of the single-point model, we train for 9 epochs on the labeled and P-GT datasets.

이 때 배치 크기와 학습률은 각각 6,

로 설정한다. Optimizer로는 Adam을 사용한다. 손실 함수의 가중치는

과

로 설정한다. 제안하는 모델은 Pytorch[10] 프레임워크를 사용하여 구현되었다.In this case, the batch size and learning rate are 6 and

set to Adam is used as the optimizer. The weights of the loss function are

class

set to The proposed model was implemented using the Pytorch [10] framework.

3.2 정량적 결과3.2 Quantitative results

우리는 제안하는 방법이 단시점 모델의 성능 개선에 도움을 주는 것을 보이기 위하여 2가지 baseline 모델들과 제안하는 방법(Ours)을 정량적으로 비교한다. Baseline 모델로는 다음의 2가지 방법을 사용한다. 첫 번째는 GT 데이터셋으로 학습된 단시점 모델(Base)이고, 두 번째는 다시점 일관성 손실 함수를 적용하지 않고 P-GT와 각 시점의 추정 결과에 L1 손실 함수만을 적용한 모델(L1-only)이다.We quantitatively compare the two baseline models and the proposed method (Ours) to show that the proposed method helps improve the performance of the single-point model. As a baseline model, the following two methods are used. The first is a single-point model (Base) trained with the GT dataset, and the second is a model (L1-only) that applies only the L1 loss function to the estimation results of P-GT and each time point without applying the multi-view coherence loss function. am.

표 1은 baseline 방법들과 제안하는 방법의 정량적 성능 비교를 보여준다. 우리는 다시점 데이터에 대한 성능 개선을 보이기 위하여 Human3.6M 평가 데이터셋의 각 카메라 시점에 대하여 평가 결과를 제시하였다. 우리는 먼저 L1-only baseline이 Base 보다 높은 성능을 보임을 알 수 있다. 이 결과는 기존의 다시점 모델이 생성하는 P-GT가 학습에 도움이 될 만한 정확도를 가짐을 보여준다. 또한 제안하는 방법은 모든 카메라 시점에 대해 두 baseline 모델보다 높은 성능을 보인다. 우리는 이 결과로부터 P-GT 데이터셋과 다시점 일관성 손실함수가 단시점 모델의 3차원 휴먼 자세 추정 성능을 실질적으로 개선할 수 있음을 확인하였다.Table 1 shows the quantitative performance comparison of the baseline methods and the proposed method. We presented evaluation results for each camera view of the Human3.6M evaluation dataset to show performance improvement for multi-viewpoint data. We can see that the L1-only baseline shows higher performance than the baseline. This result shows that the P-GT generated by the existing multi-view model has an accuracy that is conducive to learning. Also, the proposed method shows higher performance than the two baseline models for all camera views. From these results, we confirmed that the P-GT dataset and the multi-view coherence loss function can substantially improve the 3D human posture estimation performance of the single-view model.

3.3 정성적 결과3.3 Qualitative results

이 때 배치 크기와 학습률은 각각 6,

과

set to Adam is used as the optimizer. The weights of the loss function are

class

set to The proposed model was implemented using the Pytorch [10] framework.

도 4는 Baseline, 제안하는 방법, GT의 정성적 비교Figure 4 is a qualitative comparison of baseline, proposed method, and GT

4. 결론4. Conclusion

본 발명에 따른 실시예들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되고 컴퓨터 판독 가능 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조를 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 기록 매체는 스토리지, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, 스토리지 등과 같은 저장 매체에 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치가 포함될 수 있다.　프로그램 명령의 예는 컴파일러에 의해 만들어지는 것과, 기계어 코드 뿐만아니라 인터프리터를 사용하여 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.　상기 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로써 작동하도록 구성될 수 있다.Embodiments according to the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer readable recording medium. The computer readable recording medium may include program instructions, data files, and data structures alone or in combination. Computer-readable recording media include storage, hard disks, magnetic media such as floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - A hardware device configured to store and execute program instructions in storage media such as magneto-optical media, ROM, RAM, flash memory, storage, etc. may be included. Examples of program instructions may include those produced by a compiler, machine language codes as well as high-level language codes that can be executed by a computer using an interpreter. The hardware device may be configured to operate as one or more software modules to perform the operations of the present invention.

이상에서 설명한 바와 같이, 본 발명의 방법은 프로그램으로 구현되어 컴퓨터의 소프트웨어를 이용하여 읽을 수 있는 형태로 기록매체(CD-ROM, RAM, ROM, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등)에 저장될 수 있다. As described above, the method of the present invention is implemented as a program and can be read using computer software on a recording medium (CD-ROM, RAM, ROM, memory card, hard disk, magneto-optical disk, storage device, etc.) ) can be stored in

본 발명의 구체적인 실시예를 참조하여 설명하였지만, 본 발명은 상기와 같이 기술적 사상을 예시하기 위해 구체적인 실시 예와 동일한 구성 및 작용에만 한정되지 않고, 본 발명의 기술적 사상과 범위를 벗어나지 않는 한도 내에서 다양하게 변형하여 실시될 수 있으며, 본 발명의 범위는 후술하는 특허청구범위에 의해 결정되어야 한다.Although described with reference to specific embodiments of the present invention, the present invention is not limited to the same configuration and operation as the specific embodiments to illustrate the technical idea as described above, and within the limit that does not deviate from the technical spirit and scope of the present invention It can be implemented with various modifications, and the scope of the present invention should be determined by the claims described later.

101: 다시점 영상 102: P-GT 생성부(다시점 모델)
103: 단기점 모델 학습부 107: 단시점 영상
108: 3차원 휴먼 자세 추정부(학습된 단시점 모델)
109: 3차원 휴먼 자세 추정 결과 101: multi-view image 102: P-GT generator (multi-view model)
103: short-view model learning unit 107: short-view image
108: 3-dimensional human posture estimator (learned single-view model)
109: 3D human posture estimation result

Claims

Multi-view image without true value (GT)

is input, and the pseudo true value (P-GT) for the multi-view image

a P-GT generation unit that outputs;
a single-view model learning unit inputted together with the P-GT generated by the P-GT generation unit and the multi-view image and used for learning a single-view model; and
The single-view model learned in the single-view model learning unit includes a 3-dimensional human posture estimation unit used for estimating a 3-dimensional human posture from an input single-view image,
The multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation, wherein the single-view image is input to the learned single-view model of the 3-dimensional human posture estimator and outputs a 3-dimensional human posture estimation result.

According to claim 1,

is the coordinates of j three-dimensional joints and the viewpoint

It is a pseudo true value (P-GT) corresponding to the image captured in the generated P-GT together with the multi-viewpoint image and is input to the single-viewpoint model learning unit and used for learning the single-viewpoint model, 3D human posture estimation. A multi-view semi-supervised learning system of a single-view model for

According to claim 1,
The P-GT generation unit includes a pretrained multi-view model and a P-GT generation process through normalization, and generates a P-GT for an image of each view from the multi-view image,
The multi-view model receives the multi-view image and is implemented with a 2-dimensional joint coordinate output network composed of an artificial neural network and an algebraic triangulation method. 3D joint coordinates

outputs,
to the next

To generate a P-GT that can be used for learning a single-point model from

at each point

video about

Normalize to the heat map space corresponding to , which uses calibrated camera parameters.

of

Consists of perspective projection for , normalization for pixel coordinates, and normalization for depth coordinates, first,

go

When the j-th joint of

of

for the perspective projection of

is converted to the camera coordinate system, and this process is performed as follows,

in the above expression

and

Means an external parameter of the camera for the viewpoint c, through which the three-dimensional coordinates of the world coordinate system

is converted to the camera coordinate system, and then, camera internal parameters

use with

go

2D joint coordinates projected in perspective

as follows

A multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation.

According to claim 3,
Normalization of the acquired two-dimensional coordinates is performed through the following two equations,

coordinates, and the size of the input image at this time is

, the size of the heatmap is

is,
Normalization for depth coordinates is performed as

To normalize the depth coordinates of each joint, the size of the human object is

Assuming that the following

is the depth value of the pelvic joint in

is subtracted to obtain a depth value defined relative to the pelvic joint, and the depth value of these joint coordinates is

is in the range of and the normalized depth value is the depth axis range of the heatmap

made to exist within, normalized

normalized joint coordinates by summing

get,
By performing the above normalization at each joint and all viewpoints, the P-GT generation unit calculates the P-GT for each viewpoint of the multi-view image.

A multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation, which outputs.

According to claim 4,
The single-view model learning unit learns a single-view model using the multi-view image and the generated P-GT, and can utilize not only the P-GT but also multi-view image data in which available GTs exist. the model is

constituting

Each image is input and the posture estimation result is

Outputs , and the single-point model is an encoder composed of an artificial neural network.

, decoder

, and the input is

and

, a 3D heat map is generated, and finally, by applying soft-argmax to the generated heat map, the single view model outputs the 3D coordinates of each joint. semi-supervised learning system.

According to claim 1,
Train a single-view model using multi-view image data,
A labeled dataset is used for pre-training of the multi-view model, and the number of epochs, batch size, and learning rate are 6, 8, and 8, respectively.

set to
The same training dataset is used for pre-learning of the single-point model, and the number of epochs, batch size, and learning rate are 20, 32, and

has been set,
A multi-view semi-supervised learning system of a single-view model for 3-dimensional human posture estimation using Adam as the optimizer used for pre-training of the two models.

According to claim 1,
Now, for the learning of the single-view model, the L1 loss function is calculated based on the output joint coordinates and the P-GT or GT corresponding to the input image,
Loss function for fine-tuning a pretrained single-point model

is defined as:

(10)
here,

is the output of the single-point model

It is defined as the sum of L1 loss functions based on P-GT and GT applied to

and

is a weight that determines the influence of each loss function,
also,

In order to ensure that all human postures obtained from each image of are the same posture, each estimation result of the model is converted into a world coordinate system, and an L1 loss function is calculated between all estimation results. , and defines a loss function that guarantees the consistency of the posture estimation results as a 'multi-view coherence loss function' and learns a single-view model through a backpropagation algorithm from the weighted sum of the two loss functions. A multi-view semi-supervised learning system of a single-view model for