KR20220024517A

KR20220024517A - 3D scene reconstruction from 2D images

Info

Publication number: KR20220024517A
Application number: KR1020227001123A
Authority: KR
Inventors: 리자 알프 굴러; 게오르기오스 파판드레우; 이아손 코키노스
Original assignee: 아리엘 에이아이, 인크.
Priority date: 2019-06-17
Filing date: 2020-06-17
Publication date: 2022-03-03
Also published as: WO2020254448A1; EP3983996A1; US20220358770A1; CN114026599A

Abstract

본 명세서는 신경망을 이용하여 2차원(2D) 이미지들로부터 3차원(3D) 장면들을 재구성하는 것에 관한 것이다. 본 명세서의 제1 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 방법이 설명되고, 방법은: 단일 2차원 이미지를 수신하는 단계; 재구성될 이미지 내의 모든 객체들을 식별하고 상기 객체들의 유형을 식별하는 단계; 각각의 식별된 객체의 3차원 표현을 추정하는 단계; 모든 3차원 객체들을 물리적으로 지지하는 3차원 평면을 추정하는 단계; 및 모든 3차원 객체들을 지지 평면에 대해 공간 내에 위치시키는 단계를 포함한다.This specification relates to reconstructing three-dimensional (3D) scenes from two-dimensional (2D) images using a neural network. According to a first aspect of the present specification, a method is described for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image, the method comprising: receiving a single two-dimensional image; identifying all objects in the image to be reconstructed and identifying the types of the objects; estimating a three-dimensional representation of each identified object; estimating a three-dimensional plane that physically supports all three-dimensional objects; and positioning all three-dimensional objects in space with respect to the support plane.

Description

3D scene reconstruction from 2D images

본 명세서는 신경망을 이용하여 2차원(2D) 이미지들로부터 3차원(3D) 장면들을 재구성하는 것에 관한 것이다.This specification relates to reconstructing three-dimensional (3D) scenes from two-dimensional (2D) images using a neural network.

얼굴들의 더 쉬운 문제는 별도로 하고, 3D 인체(또는 다른 객체들)를 실시간으로 증강 현실 애플리케이션들에 통합하기 위한 어떠한 연구도 거의 존재하지 않는다. 신체들은 그들의 관절형 변형들, 다수의 가림들 및 자기 가림들, 다른 객체들 및 사람들과의 복잡한 상호작용들, 및 의류로 인한 더 넓은 외관 변동성으로 인해 얼굴들보다 일반적으로 더 어렵다.Aside from the easier problem of faces, there is little research to integrate a 3D human body (or other objects) into augmented reality applications in real time. Bodies are generally more difficult than faces due to their articulated deformations, multiple occlusions and self occlusions, complex interactions with other objects and people, and wider appearance variability due to clothing.

인간 포즈 추정 알고리즘들은 전형적으로, 이미지들 내의 특정 희소 포인트들, 예컨대, 골격 관절들 및 얼굴 랜드마크들, 또는 더 최근에는 조밀한 표면 레벨 좌표들을 위치파악하는 것을 목표로 한다. 그들의 증가하는 정확도에도 불구하고, 그러한 표현들은 증강 현실, 모션 캡처, 게이밍, 또는 그래픽들과 같은 다운스트림 애플리케이션들을 서빙하는 것에 부족하다. 이들은 3차원의 기저 인체 표면에 대한 액세스를 요구하고 현재는 다중 카메라 셋업들 또는 깊이 센서들에 의존한다.Human pose estimation algorithms typically aim to locate specific sparse points in images, such as skeletal joints and facial landmarks, or more recently, dense surface level coordinates. Despite their increasing accuracy, such representations fall short of serving downstream applications such as augmented reality, motion capture, gaming, or graphics. They require access to a three-dimensional basal body surface and currently rely on multiple camera setups or depth sensors.

모핑가능한 모델들에 대한 연구는, 얼굴 표면 및 외관의 저차원 파라메트릭을 이용하고, 재구성 작업을 최적화 문제로서 캐스팅함으로써 정확한 단안 표면 재구성을 수행할 수 있다는 것을 입증하였다. 이를 인체의 더 복잡한 관절 구조로 확장하면, 단안 인체 재구성은 부분 기반 표현들, 샘플링 기반 추론, 시공간 추론 및 상향식/상향식 방법들과 함께 이전 10년간 광범위하게 연구되었다. 단안 3D 재구성은 일반 범주들 및 특히 인간들에 대한 심층 학습의 맥락에서의 부흥을 목격하였다. 이전 연구들은, 스키닝된 선형 모델들 및 특히 스키닝된 다인 선형(SMPL) 모델의 관점에서 인체의 효율적인 파라미터화에 의존한다. SMPL 모델이 인체의 저차원의 미분가능한 표현을 제공한다는 사실을 이용하여, 이러한 연구들은 SMPL 기반 3D 키 포인트들과 2D 관절 주석들, 인간 세그먼트화 마스크들 및 3D 볼륨 투영들 사이의 재투영 에러를 최소화하거나 심지어 신체 부분들의 레벨로 정밀화함으로써 시스템들을 모델 파라미터들을 회귀시키도록 훈련시켰다.The study of morphable models demonstrated that accurate monocular surface reconstruction can be performed by using the low-dimensional parametric of the facial surface and appearance, and casting the reconstruction operation as an optimization problem. Extending this to more complex articular structures of the human body, monocular anatomical reconstruction has been extensively studied in the previous decade, along with part-based representations, sampling-based reasoning, spatiotemporal reasoning, and bottom-up/bottom-up methods. Monocular 3D reconstruction has witnessed a resurgence in the context of deep learning for general categories and especially for humans. Previous studies rely on efficient parameterization of the human body in terms of skinned linear models and in particular a skinned dyne linear (SMPL) model. Taking advantage of the fact that the SMPL model provides a low-dimensional, differentiable representation of the human body, these studies have investigated the reprojection error between SMPL-based 3D key points and 2D joint annotations, human segmentation masks and 3D volume projections. Systems were trained to regress model parameters by minimizing or even refining to the level of body parts.

이와 병행하여, 3D 인간 관절 추정은, 고전적인 모션으로부터-구조 접근법들로부터, 분류 및 회귀의 하이브리드들을 통해 볼륨 출력 공간에서 3D 관절들을 직접 위치파악하는 3D 콘볼루션 신경망(CNN) 기반의 아키텍처들로 넘어감으로써 정확도의 극적인 상승을 보였다.In parallel, 3D human joint estimation has been developed with 3D convolutional neural network (CNN) based architectures that directly localize 3D joints in volume output space through hybrids of classification and regression, from classical motion-to-structure approaches. By going over it, there was a dramatic increase in accuracy.

마지막으로, 조밀한 포즈 추정에 대한 최근의 연구는, 이미지 픽셀들을 표면 레벨 UV 좌표들과 연관시키기 위해 일반적인 상향식 검출 시스템을 훈련시킴으로써 RGB 이미지들과 SMPL 모델 사이의 조밀한 대응관계들을 추정할 수 있다는 것을 보여주었다. 조밀한 포즈가 이미지들과 표면들 사이의 직접 링크를 확립하더라도, 이는 기저 3D 지오메트리를 커버하지 않고, 오히려 그에 관한 강한 힌트를 제공한다는 점에 주목해야 한다. 다른 최근의 연구는, SMPL과 같은, 형상의 파라메트릭 인간 모델에 의존하며, 이는 저차원 파라미터 벡터의 관점에서 3D 인체 표면을 설명하는 것을 허용한다. 이러한 연구들 중 대부분은 CNN을 변형가능 모델 파라미터들을 회귀시키고, 반복적이고 시간-요구형 최적화 프로세스에서 그것들을 업데이트하도록 훈련시킨다.Finally, a recent study on dense pose estimation suggests that it is possible to estimate dense correspondences between RGB images and SMPL models by training a general bottom-up detection system to associate image pixels with surface-level UV coordinates. showed that It should be noted that although the dense pose establishes a direct link between images and surfaces, it does not cover the underlying 3D geometry, but rather gives a strong hint about it. Other recent studies, such as SMPL, rely on parametric human models of shape, which allow describing 3D human body surfaces in terms of low-dimensional parametric vectors. Most of these studies train CNNs to regress deformable model parameters and update them in an iterative and time-required optimization process.

본 명세서의 제1 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 방법이 설명되고, 방법은: 단일 2차원 이미지를 수신하는 단계; 재구성될 이미지 내의 모든 객체들을 식별하고 상기 객체들의 유형을 식별하는 단계; 각각의 식별된 객체의 3차원 표현을 추정하는 단계; 모든 3차원 객체들을 물리적으로 지지하는 3차원 평면을 추정하는 단계; 및 모든 3차원 객체들을 지지 평면에 대해 공간 내에 위치시키는 단계를 포함한다.According to a first aspect of the present specification, a method is described for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image, the method comprising: receiving a single two-dimensional image; identifying all objects in the image to be reconstructed and identifying the types of the objects; estimating a three-dimensional representation of each identified object; estimating a three-dimensional plane that physically supports all three-dimensional objects; and positioning all three-dimensional objects in space with respect to the support plane.

3차원 표현을 추정하는 단계는 심층 기계 학습 모델에서 수행될 수 있고, 심층 기계 학습 모델은 출력 계층 및 출력을 생성하기 위해, 수신된 입력에 비선형 변환을 각각 적용하는 하나 이상의 은닉 계층을 포함한다. 심층 기계 학습 모델은 신경망의 하나 이상의 중간 계층으로부터 특징 데이터를 연결함으로써 다수의 객체들의 3차원 랜드마크 위치들을 예측하고, 예측된 3차원 위치들은 각각의 영역에 묘사된 객체의 예측된 유형에 대해 동시에 추정된다.The step of estimating the three-dimensional representation may be performed in a deep machine learning model, the deep machine learning model comprising an output layer and one or more hidden layers each applying a nonlinear transformation to the received input to generate an output. The deep machine learning model predicts three-dimensional landmark positions of multiple objects by connecting feature data from one or more intermediate layers of a neural network, and the predicted three-dimensional positions are simultaneously applied to the predicted types of objects depicted in each region. It is estimated.

다수의 객체들을 지지하는 평면을 추정하는 단계는, 모든 가시 객체들을 통과하는 2차원 평면을 재구성하기 위해 모든 가시 객체들의 추정된 3차원 위치들을 이용하여 단일 프레임에 대해 수행될 수 있다. 다수의 객체들을 지지하는 평면을 추정하는 단계는, 상대적 카메라 포즈 추정 및 연속적인 프레임들의 포인트들 사이의 대응관계들을 이용한 평면 위치파악을 이용하여 일련의 프레임들에 대해 수행될 수 있다.The step of estimating the plane supporting the plurality of objects may be performed for a single frame using the estimated three-dimensional positions of all visible objects to reconstruct a two-dimensional plane passing through all the visible objects. The step of estimating a plane supporting multiple objects may be performed for a series of frames using relative camera pose estimation and plane localization using correspondences between points of successive frames.

수신 단계는 복수의 이미지들을 수신하는 단계를 더 포함할 수 있고, 여기서, 다수의 객체들의 3차원 표현들을 추정하고 그들을 소정 장소에 위치시키는 단계들은 각각의 수신된 이미지에 대해, 예를 들어, 실시간으로 이루어진다. 처리는, 연속적인 프레임들에서 은닉 계층 응답들을, 예를 들어, 그들을 평균화하는 것에 의해 결합함으로써 다수의 연속적인 프레임들에 걸쳐 발생할 수 있다.The receiving step may further comprise receiving a plurality of images, wherein the steps of estimating three-dimensional representations of the plurality of objects and locating them in place are for each received image, eg, in real time is made of Processing may occur over multiple successive frames by combining the hidden layer responses in successive frames, eg, by averaging them.

디지털 그래픽 객체들은 추정된 3차원 객체 위치들에 대해 주어진 관계로 3차원 장면 재구성에 합성적으로 추가되고, 그 후 2차원 이미지에 다시 투영될 수 있다.Digital graphic objects can be synthetically added to the three-dimensional scene reconstruction with a given relationship to the estimated three-dimensional object positions, and then projected back into the two-dimensional image.

본 명세서의 추가 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 계산 유닛이 설명되며, 계산 유닛은 메모리; 및 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는 제1 양상에 따른 방법을 수행하도록 구성된다.According to a further aspect of the present specification, a computational unit is described for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image, the computational unit comprising: a memory; and at least one processor, wherein the at least one processor is configured to perform the method according to the first aspect.

본 명세서의 추가 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 계산 유닛으로 하여금 제1 양상에 따른 방법을 수행하게 하기 위해 계산 유닛의 적어도 하나의 프로세서에 의해 실행가능한 명령어들의 세트를 저장하는 컴퓨터 판독가능 매체가 설명된다.According to a further aspect of the present specification, to at least one processor of the computation unit, for causing a computation unit for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image to perform the method according to the first aspect A computer readable medium is described that stores a set of instructions executable by the

본 명세서의 추가 양상에 따르면, 컴퓨터에 의해 실행될 때, 컴퓨터로 하여금 제1 양상에 따른 방법을 수행하게 하는 명령어들을 포함하는 컴퓨터 프로그램 제품이 설명된다.According to a further aspect of the present specification, a computer program product is described comprising instructions that, when executed by a computer, cause the computer to perform a method according to the first aspect.

본 명세서의 추가적인 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위해 심층 기계 학습 모델을 훈련시키기 위한 방법이 설명되며, 방법은: 단일 2차원 이미지를 수신하는 단계; 객체의 3차원 모델을 2차원 이미지에 적응시키는 것을 통해 3차원 재구성을 위한 훈련 신호를 획득하는 단계; 및 결과적인 3차원 모델 피팅 결과들을 심층 기계 학습 모델의 훈련을 위한 감독 신호로서 이용하는 단계를 포함한다.According to a further aspect of the present specification, a method is described for training a deep machine learning model to generate a three-dimensional reconstruction of a scene having multiple objects from a single two-dimensional image, the method comprising: receiving a single two-dimensional image; step; obtaining a training signal for three-dimensional reconstruction through adapting a three-dimensional model of an object to a two-dimensional image; and using the resulting three-dimensional model fitting results as a supervisory signal for training the deep machine learning model.

3차원 모델을 피팅하는 단계는 2차원 이미지 평면 상에 3차원 표현을 투영하여 투영된 표현을 생성하는 단계; 투영된 표현의 각각의 위치들을 단일 2차원 이미지의 객체와 비교하는 단계; 비교에 기초하여 에러 값을 측정하는 단계; 및 에러 값에 기초하여 융합된 3차원 표현의 파라미터들을 조정하는 단계 - 비교, 측정 및 조정 단계는 임계 조건이 만족될 때까지 반복적으로 반복됨 - 에 의해 수행될 수 있다. 임계 조건은 측정된 에러 값이, 미리 결정된 임계 값 아래로 떨어지는 것 또는 반복들의 임계 횟수가 초과되는 것일 수 있다. 투영하는 단계는, 원근 투영의 효과들을 고려함으로써, 그리고, 다수의 뷰들이 이용가능한 경우, 동일한 객체의 다수의 뷰들을 이용함으로써 수행될 수 있다.Fitting the three-dimensional model may include projecting the three-dimensional representation onto a two-dimensional image plane to generate a projected representation; comparing the respective positions of the projected representation with the object of the single two-dimensional image; determining an error value based on the comparison; and adjusting the parameters of the fused three-dimensional representation based on the error value, wherein the comparing, measuring and adjusting steps are iteratively repeated until a threshold condition is satisfied. The threshold condition may be that the measured error value falls below a predetermined threshold value or that a threshold number of iterations is exceeded. The projecting step may be performed by taking into account the effects of perspective projection, and, if multiple views are available, using multiple views of the same object.

본 명세서의 추가 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 계산 유닛이 설명되며, 계산 유닛은 메모리; 및 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는 추가적인 양상에 따라 수행하도록 구성된다.According to a further aspect of the present specification, a computational unit is described for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image, the computational unit comprising: a memory; and at least one processor, wherein the at least one processor is configured to perform according to a further aspect.

본 명세서의 추가 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 계산 유닛으로 하여금 추가적인 양상에 따른 방법을 수행하게 하기 위해 계산 유닛의 적어도 하나의 프로세서에 의해 실행가능한 명령어들의 세트를 저장하는 컴퓨터 판독가능 매체가 설명된다.According to a further aspect of the present specification, by at least one processor of the calculating unit to cause a calculating unit for generating a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image to perform the method according to the further aspect A computer readable medium storing a set of executable instructions is described.

본 명세서의 추가 양상에 따르면, 컴퓨터에 의해 실행될 때, 컴퓨터로 하여금 추가적인 양상에 따른 방법을 수행하게 하는 명령어들을 포함하는 컴퓨터 프로그램 제품이 설명된다.According to a further aspect of the present specification, a computer program product is described comprising instructions that, when executed by a computer, cause the computer to perform a method according to the further aspect.

본 명세서의 추가 양상에 따르면, 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 제공하기 위한 시스템이 설명되고, 시스템은: 제1 양상의 방법들 중 임의의 방법을 수행하기 위한 제1 계산 유닛; 및 추가적인 양상에 따른 방법을 수행하도록 구성된 제2 계산 유닛을 포함하고, 제2 유닛은 제1 계산 유닛의 결과들로 모델을 훈련시키도록 구성된다.According to a further aspect of the present specification, a system is described for providing a three-dimensional reconstruction of a scene having a plurality of objects from a single two-dimensional image, the system comprising: a second for performing any of the methods of the first aspect 1 counting unit; and a second computation unit configured to perform the method according to a further aspect, wherein the second unit is configured to train the model with results of the first computation unit.

실시예들 및 예들은 첨부 도면들을 참조하여 설명될 것이고, 첨부 도면들에서:
도 1은 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 예시적인 방법의 개략적인 개요를 도시하고;
도 2는 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 방법의 흐름도를 도시하고;
도 3은 단일 2차원 이미지로부터 다수의 객체들을 갖는 3차원 장면을 재구성하기 위해 신경망을 훈련시키기 위한 예시적인 방법의 개략적인 개요를 도시하고;
도 4는 단일 2차원 이미지로부터 다수의 객체들을 갖는 3차원 장면을 재구성하는 데 이용하기 위한 신경망을 훈련시키기 위한 방법의 흐름도를 도시하고;
도 5는 3차원 장면 재구성 및 객체 재식별에 이용하기 위한 신경망을 훈련시키기 위한 방법의 추가 예를 도시하고;
도 6은 본원에 설명된 방법들 중 임의의 방법을 수행하기 위한 시스템/장치의 개략적인 예를 도시한다.Embodiments and examples will be described with reference to the accompanying drawings, in which:
1 shows a schematic overview of an exemplary method for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image;
2 shows a flow diagram of a method for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image;
3 shows a schematic overview of an exemplary method for training a neural network to reconstruct a three-dimensional scene with multiple objects from a single two-dimensional image;
4 shows a flow diagram of a method for training a neural network for use in reconstructing a three-dimensional scene with multiple objects from a single two-dimensional image;
5 shows a further example of a method for training a neural network for use in three-dimensional scene reconstruction and object re-identification;
6 shows a schematic example of a system/apparatus for performing any of the methods described herein.

본 명세서는 단일 2D 이미지(예를 들어, RGB 이미지)로부터 다수의 객체들(예를 들어, 인간들)을 포함하는 3D 장면의 재구성을 위한 방법들 및 시스템들을 설명한다. 본원에 설명된 예시적인 방법들 및 시스템들은 모바일 디바이스(예를 들어, 모바일 폰) 상에서 초당 30 프레임 초과로 다수의 객체들의 정확한 3D 재구성들을 복구하면서, 또한, 3D 카메라 위치 및 세계 좌표들에 관한 정보를 복구할 수 있다. 본원에 설명된 방법들 및 시스템들은, 예를 들어, 전체 인체를 수반하는 실시간 증강 현실 애플리케이션들에 적용될 수 있고, 이용자들이 그들 상에 위치된 객체들, 예를 들어, 그들의 손들에 부착된 그래픽 자산들을 제어하는 것을 허용하면서, 또한, 객체들이 인간들과 상호작용하는 것, 예를 들어, 공들이 인간들의 신체와 접촉하면 공들이 인간들로부터 되튀는 것을 허용한다.This specification describes methods and systems for reconstruction of a 3D scene containing multiple objects (eg, humans) from a single 2D image (eg, RGB image). Exemplary methods and systems described herein recover accurate 3D reconstructions of multiple objects at more than 30 frames per second on a mobile device (eg, a mobile phone), while also providing information regarding 3D camera position and world coordinates. can be restored The methods and systems described herein may be applied, for example, to real-time augmented reality applications involving the entire anatomy, where users attach to objects positioned on them, eg, a graphic asset attached to their hands. While allowing objects to control them, it also allows objects to interact with humans, eg, allowing balls to bounce off humans when they come into contact with their bodies.

또한, 본 명세서는 모바일 디바이스 상에서 초당 30 프레임 초과로 다수의 (잠재적으로 수백 개의) 객체들(예를 들어, 인간들)의 3D 형상의 추정을 허용하는 CNN 설계 선택들을 설명한다. 본원에 설명된 방법들을 이용하여, 장면 내의 객체들(예를 들어, 사람들)의 개수에 관계없이, 스냅드래곤(SnapDragon) 855 신경 처리 유닛(NPU) 상에서 프레임당 약 30 밀리초의 일정한 시간 내에 결과들이 획득될 수 있다.This specification also describes CNN design choices that allow estimation of the 3D shape of multiple (potentially hundreds) objects (eg, humans) at more than 30 frames per second on a mobile device. Using the methods described herein, results are generated within a constant time of about 30 milliseconds per frame on a SnapDragon 855 Neural Processing Unit (NPU), regardless of the number of objects (eg, people) in the scene. can be obtained.

본원에 설명된 여러 양상들은 단독으로 또는 조합하여 이러한 효과들에 기여할 수 있다.Various aspects described herein may contribute to these effects, alone or in combination.

일부 실시예들에서, 단안 이미지 재구성을 위한 감독 신호를 구성하기 위해 망 증류가 이용된다. 상세한 시간-요구형 모델 피팅 절차는 훈련 세트 내의 모든 이미지들의 3D 해석을 복구하기 위해 오프라인으로 수행된다. 그 후, 피팅 결과들은 들어오는 테스트 이미지들을 망을 통한 단일 피드포워드 패스로 효율적으로 처리하기 위해 하나 이상의 신경망(예를 들어, 콘볼루션 신경망)을 훈련시키기 위해 이용된다. 이는, 테스트 시간에서 속도를 해치지 않고, 모델 피팅(예를 들어, 영역 정렬, 키 포인트들 및 조밀한 포즈에 각각 기초한 희소하고 조밀한 재투영 에러 등) 동안 상보적인 실측 정보 신호들에 의해 부과되는 복잡한 제약들의 통합을 허용한다.In some embodiments, mesh distillation is used to construct a supervisory signal for monocular image reconstruction. A detailed time-required model fitting procedure is performed offline to recover the 3D interpretation of all images in the training set. The fitting results are then used to train one or more neural networks (eg, convolutional neural networks) to efficiently process incoming test images into a single feedforward pass through the network. This, without compromising speed at test time, is imposed by complementary ground truth signals during model fitting (e.g., sparse and dense reprojection error based on region alignment, key points and tight poses, respectively). Allows the integration of complex constraints.

일부 실시예들에서, 효율적인 인코더 전용 신경망 아키텍처는 단안 3차원 인간 포즈 추정에 이용된다. 초기 연구들은, 인간 표면의 파라메트릭 의장 모델들, 및 2D의, 인간 관절들의 철저하고 정확한 위치파악을 위한 디코더 기반의 망들에 의존했다. 대신에, 이미지 분류에 이용되는 용도변경된(표준) 심층 신경망이 이용되어, 마지막 계층이 객체(예를 들어, 인간) 메쉬의 각각의 정점에 대한 3D 좌표를 출력하도록 구성될 수 있다. 이는 테스트 시간 추론을 실질적으로 가속화할 수 있고, 또한, 모바일 디바이스들 상에 망들을 배치하는 것을 간단하게 할 수 있다.In some embodiments, an efficient encoder-only neural network architecture is used for monocular three-dimensional human pose estimation. Early studies relied on parametric design models of the human surface, and decoder-based networks for thorough and accurate localization of human joints in 2D. Instead, the repurposed (standard) deep neural network used for image classification can be used, so that the last layer is configured to output the 3D coordinates for each vertex of the object (eg, human) mesh. This may substantially accelerate test time inference, and may also simplify deploying networks on mobile devices.

일부 실시예들에서, 신경망은, 입력 2D 이미지를 조밀하게 처리하고 (표준) 콘볼루션 망을 통한 단일 패스로 다수의 이미지 위치들에 대한 3D 포즈 추정들을 방출하는 단일 스테이지의 완전 콘볼루션 아키텍처를 가질 수 있다. 이는, 장면 내의 객체들(예를 들어, 사람들)의 개수와는 무관한 시간인 추론을 초래하면서, 또한, 시간 경과에 따른 3D 재구성들의 평활화를 극적으로 단순화할 수 있는데, 이는, 사후 평활화를 수행하는 것 대신에, 시간 경과에 따라 망 계층들을 평균화할 수 있기 때문이다.In some embodiments, the neural network will have a single stage fully convolutional architecture that densely processes the input 2D image and emits 3D pose estimates for multiple image positions in a single pass through the (standard) convolutional network. can This may result in an inference that is time independent of the number of objects (eg, people) in the scene, while also dramatically simplifying the smoothing of 3D reconstructions over time, which performs post smoothing Instead, it is possible to average the network layers over time.

일부 실시예들에서, 2D 이미지에서 객체들/하위 객체들(예를 들어, 인간의 발)의 추정된 3D 위치들에 평면을 피팅함으로써 3차원 장면에서 바닥 위치를 추정하는 사람 기반 자기 교정 방법이 이용된다. 이는 원근 투영의 스케일링 효과들을 무효화하는 방식으로 세계 지오메트리의 복구를 허용하면서, 또한, 평면에 대해 카메라 위치를 복구한다. 차례로, 세계 지오메트리를 복구하는 것은, 예를 들어, 물리학의 법칙을 준수하면서 바닥에 떨어지고 다시 튈 수 있는 객체들을 삽입함으로써, 복구된 장면의 증강을 허용한다. 방법은, 카메라가 정적일 때 - 이 경우, 동시적 위치파악 및 맵핑(SLAM) 방법들이 실패함 -, 그리고 또한, 카메라가 이동하는 경우들 양쪽 모두에서, 예를 들어, 추정된 바닥 위치를 SLAM에 의해 복구된 평면들과 정렬함으로써 작동할 수 있다.In some embodiments, a person-based self-calibration method for estimating a floor position in a 3D scene by fitting a plane to the estimated 3D positions of objects/sub-objects (eg, a human foot) in a 2D image is used This allows recovery of the world geometry in a way that negates the scaling effects of the perspective projection, while also recovering the camera position relative to the plane. In turn, reconstructing the world geometry allows augmentation of the reconstructed scene, for example by inserting objects that can fall to the floor and bounce back while obeying the laws of physics. The method is, in both cases when the camera is stationary - in which case simultaneous localization and mapping (SLAM) methods fail - and also in both cases when the camera is moving, e.g., SLAM the estimated floor position. It can work by aligning with the planes restored by

일부 실시예들에서, 이러한 접근법의 분산 부분 기반 변형은, 추정된 객체 부분 위치들에 의해 표시된 바와 같이, 다수의 이미지 위치들로부터 메쉬 부분들에 대한 정보를 수집한다. 신체 메쉬는 신경망에 의해 출력된 부분 레벨 메쉬들을 함께 합성함으로써 획득되고, 이에 의해, 메모리 검색 작동은 별도로 하고, 정확히 동일한 메모리 및 계산 비용을 유지하면서, 가림들, 큰 관절들의 더 양호한 처리를 허용한다.In some embodiments, a distributed part based variant of this approach collects information about mesh parts from multiple image positions, as indicated by estimated object part positions. The body mesh is obtained by synthesizing together the partial-level meshes output by the neural network, thereby allowing better handling of occlusions, large joints, while maintaining exactly the same memory and computational cost, apart from the memory retrieval operation. .

일부 실시예들에서, 신경망은 추가적으로, 객체/사람 재식별(REID)의 작업을 수용할 수 있다. 교사-학생 망 증류 접근법은 망을 그러한 작업을 수행하도록 훈련시키는 데 이용될 수 있다. REID 임베딩들은 미리 훈련된 REID 망을 이용하여 2D 이미지들로부터의 객체(예를 들어, 인간) 크롭들로부터 추출되고, 이들은 신경망의 REID 분기를 훈련시키기 위한 감독 신호로서 이용된다. 이 분기는 교육 망의 것들을 모방하는 REID 임베딩들을 전달하지만, 완전 콘볼루션일 수 있는데, 이는 그의 실행 속도가 이미지 내의 객체들의 개수와는 무관하다는 것을 의미한다.In some embodiments, the neural network may additionally accommodate the task of object/person re-identification (REID). A teacher-student network distillation approach can be used to train a network to perform such tasks. REID embeddings are extracted from object (eg, human) crops from 2D images using a pre-trained REID network, and these are used as supervisory signals to train the REID branch of the neural network. This branch carries REID embeddings that mimic those of the training network, but can be fully convolutional, meaning that its execution speed is independent of the number of objects in the image.

도 1은 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 예시적인 방법(100)의 개략적인 개요를 도시한다. 방법은 하나 이상의 위치에서 작동하는 하나 이상의 컴퓨팅 디바이스에 의해 수행될 수 있다. 예를 들어, 방법은 모바일 컴퓨팅 디바이스, 예컨대, 모바일 폰에 의해 수행될 수 있다.1 shows a schematic overview of an exemplary method 100 for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image. The method may be performed by one or more computing devices operating at one or more locations. For example, the method may be performed by a mobile computing device, such as a mobile phone.

주어진 유형(예컨대, 사람)의 복수의 객체들(104)을 각각 포함하는 하나 이상의 2D 이미지(102)가 신경망(106)에 입력된다. 일부 실시예들에서, 단일 입력 이미지(102)만이 이용된다. 신경망(106)은 이미지 내의 객체들(104)을 식별하고 입력 2D 이미지(102) 내의 각각의(잠재적) 식별된 객체(104)의 추정된 3D 표현(110)을 포함하는 출력 데이터(108)를 생성하기 위해 입력 데이터를 처리한다. 출력 데이터(108)는, 일부 실시예들에서, 입력 2D 이미지(102) 내의 잠재적 객체들(104)의 경계 박스들의 추정된 좌표들(112) 및/또는 입력 2D 이미지(102) 내의 위치들에서의 객체들(104)의 존재의 확률들을 더 포함할 수 있다. 출력 데이터(108)는 입력 2D 이미지(102) 내의 장면의 3차원 재구성(116)을 생성하기 위해 더 처리된다(즉, 후처리된다). 일부 실시예들에서, 출력 데이터(108)는 이미지 내의 각각의 객체(104)와 연관된 임베딩 벡터(도시되지 않음)를 더 포함할 수 있다. 임베딩 벡터는 객체의 REID 임베딩을 제공하고, 따라서, 주어진 객체(104)는 다수의 입력 이미지들(102)에 걸쳐 식별되고/거나 추적될 수 있다.One or more 2D images 102 each comprising a plurality of objects 104 of a given type (eg, a person) are input to the neural network 106 . In some embodiments, only a single input image 102 is used. The neural network 106 identifies objects 104 in the image and outputs data 108 that includes an estimated 3D representation 110 of each (potentially) identified object 104 in the input 2D image 102 . Process the input data to create it. The output data 108 may, in some embodiments, be at the estimated coordinates 112 of the bounding boxes of potential objects 104 in the input 2D image 102 and/or at locations within the input 2D image 102 . may further include probabilities of the existence of objects 104 of The output data 108 is further processed (ie, post-processed) to generate a three-dimensional reconstruction 116 of the scene within the input 2D image 102 . In some embodiments, the output data 108 may further include an embedding vector (not shown) associated with each object 104 in the image. The embedding vector provides the REID embedding of the object, so that a given object 104 can be identified and/or tracked across multiple input images 102 .

2D 입력 이미지(102, I)는 2차원 어레이에 대응하는 픽셀 값들의 세트를 포함한다. 예를 들어, 컬러 이미지에서, I ∈ R^HxWx3이고, 여기서 H는 픽셀 단위의 이미지의 높이이고, W는 픽셀 단위의 이미지의 폭이며, 이미지는 3개의 컬러 채널들(예컨대, RGB 또는 CIELAB)을 갖는다. 일부 실시예들에서, 2D 이미지는 흑백/그레이스케일일 수 있다.The 2D input image 102 , I contains a set of pixel values corresponding to a two-dimensional array. For example, in a color image, I ∈ R ^HxWx3 , where H is the height of the image in pixels, W is the width of the image in pixels, and the image has three color channels (e.g., RGB or CIELAB). have In some embodiments, the 2D image may be black and white/grayscale.

일부 실시예들에서, 방법(100)은 (예를 들어, 움직이는 다수의 사람들의) 다수의(예를 들어, 일련의) 입력 2D 이미지들(102)을 이용할 수 있다. 3D 표현들(110)은 이러한 변경들을 고려할 수 있다. 다수의 입력 2D 이미지들(102)은, 예를 들어, 모바일 디바이스 상의 카메라로부터 실질적으로 실시간으로(예를 들어, 30 fps로) 수신될 수 있다. 신경망(106)은 이러한 입력 2D 이미지들(102) 각각을 개별적으로 처리할 수 있고, 각각의 입력 2D 이미지(102)에 대한 은닉 계층 응답들은 3D 장면을 생성할 때 후처리 동안 결합된다.In some embodiments, method 100 may use multiple (eg, series) input 2D images 102 (eg, of multiple people in motion). The 3D representations 110 may take these changes into account. Multiple input 2D images 102 may be received, for example, in substantially real time (eg, at 30 fps) from a camera on a mobile device. Neural network 106 may process each of these input 2D images 102 individually, and the hidden layer responses for each input 2D image 102 are combined during post-processing when generating the 3D scene.

신경망(106)은 입력으로서 2D 이미지(102)을 취하고, 출력 데이터(108)를 생성하기 위해 이를 복수의 신경망 계층들을 통해 처리한다. 신경망(106)은, 출력 계층 및 출력을 생성하기 위해, 수신된 입력에 비선형 변환을 각각 적용하는 하나 이상의 은닉 계층을 포함하는 심층 기계 학습 모델이다. 신경망은 신경망의 하나 이상의 중간 계층으로부터 특징 데이터를 연결함으로써 다수의 객체들의 3차원 랜드마크 위치들을 예측할 수 있다. 예측된 3차원 위치들은 각각의 영역에 묘사된 객체의 예측된 유형에 대해 동시에 추정될 수 있다.Neural network 106 takes 2D image 102 as input and processes it through a plurality of neural network layers to generate output data 108 . Neural network 106 is a deep machine learning model comprising an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to produce an output. The neural network may predict three-dimensional landmark positions of multiple objects by connecting feature data from one or more intermediate layers of the neural network. The predicted three-dimensional positions can be estimated simultaneously for the predicted type of object depicted in each region.

신경망(106)의 각각의 계층은 복수의 노드들(본원에서 또한 "뉴런"으로 지칭됨)을 포함하고, 노드들 각각은 하나 이상의 신경망 파라미터(예를 들어, 가중치 및/또는 바이어스)와 연관된다. 각각의 노드는 이전 계층 내의 하나 이상의 노드(또는 신경망의 제1 계층에 대한 입력)로부터의 출력을 입력으로 취하고 그 노드와 연관된 파라미터들에 기초하여 그 입력에 관한 변환을 적용한다. 변환은 비선형 변환일 수 있다. 대안적으로, 노드들 중 일부는 선형 변환을 적용할 수 있다.Each layer of neural network 106 includes a plurality of nodes (also referred to herein as “neurons”), each of which is associated with one or more neural network parameters (eg, weights and/or biases). . Each node takes as input the output from one or more nodes in the previous layer (or input to the first layer of the neural network) and applies a transform on that input based on parameters associated with that node. The transform may be a non-linear transform. Alternatively, some of the nodes may apply a linear transform.

신경망(106)은 하나 이상의 콘볼루션 계층을 포함할 수 있고, 이들 각각은 망 내의 이전 계층의 출력에 하나 이상의 콘볼루션 필터를 적용하도록 구성된다. 일부 실시예들에서, 신경망(106)은 완전 콘볼루션이다. 신경망은 하나 이상의 완전 연결된 계층을 포함할 수 있고, 여기서, 완전 연결된 계층 내의 각각의 노드는 이전 계층 내의 모든 노드로부터 입력을 수신한다. 신경망(106)은 하나 이상의 스킵 연결을 포함할 수 있다. 신경망(106)은 잔여 신경망, 예컨대, 레스넷-50(ResNet-50)을 포함할 수 있다. 잔여 망은 백본 망로서 이용될 수 있다.Neural network 106 may include one or more convolutional layers, each configured to apply one or more convolutional filters to the output of previous layers in the network. In some embodiments, neural network 106 is fully convolutional. A neural network may include one or more fully connected layers, where each node in the fully connected layer receives input from all nodes in the previous layer. Neural network 106 may include one or more skip connections. The neural network 106 may include a residual neural network, for example, ResNet-50. The remaining network can be used as a backbone network.

신경망(106)은 객체들(예를 들어, 인간들)을 공동으로 검출하고 (완전 콘볼루션 신경망일 수 있는) 신경망(106)을 통한 단일 포워드 패스를 수행함으로써 객체들의 3D 형상을 추정하는 단일 스테이지 시스템일 수 있다. 객체 검출 영역들 주위의 이미지 패치들을 크로핑한 다음 이들을 다시 처리하는 대신에, 영역 특정 특징들을 추출하는 작업이, 점점 더 큰 수용 필드들을 갖는 신경망(106)의 연속적인 계층들의 뉴런들에 위임된다. 속도와 정확도 사이의 우수한 절충을 제공하는 레스넷-50 백본이 이용될 수 있다. 아트로스(a-trous) 콘볼루션들이 이용될 수 있고(예를 들어, 그 내용이 참조로 본원에 포함되는, 엘. 첸(L. Chen) 등의, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs", PAMI, 2017.3을 참고), 객체(예를 들어, 사람) 가설들이 평가되는 공간 밀도를 증가시키는 것을 허용하고, 누락된 검출들의 개수를 감소시킬 수 있다.The neural network 106 is a single stage system that jointly detects objects (eg, humans) and estimates the 3D shape of the objects by performing a single forward pass through the neural network 106 (which may be a fully convolutional neural network). can be Instead of cropping image patches around object detection regions and then processing them again, the task of extracting region-specific features is delegated to neurons in successive layers of neural network 106 with increasingly larger receptive fields. . A Lesnet-50 backbone is available which offers a good compromise between speed and accuracy. A-trous convolutions can be used (eg, L. Chen et al., "Deeplab: Semantic image segmentation with deep convolutional nets, the contents of which are incorporated herein by reference." , atrous convolution, and fully connected crfs", see PAMI, 2017.3), allowing to increase the spatial density in which object (eg, human) hypotheses are evaluated, and to reduce the number of missing detections.

신경망(106)(예를 들어, 완전 콘볼루션 신경망)의 최종 계층은, 각각의 위치에 있다고 가정된 사람의 속성들에 대응하는, 그 뉴런들 각각에서의 출력 데이터(108)(예를 들어, 다수의 출력들)를 예측하는 작업을 맡는다.The final layer of the neural network 106 (eg, a fully convolutional neural network) is the output data 108 (eg, It is responsible for predicting multiple outputs).

일부 실시예들에서, 신경망(106) 계층들 중 하나 이상은 1x1 콘볼루션 계층일 수 있다. 예를 들어, 신경망의 최종 계층은 하나 이상의 1x1 콘볼루션을 포함할 수 있다. 그러한 계층들의 예들은 "Fully Convolutional Networks for Semantic Segmentation"(이. 쉘해머(E. Shelhamer) 등, IEEE Trans. Pattern Anal. Mach. Intell. 39(4): 640-651(2017)) 및 "Overfeat: Integrated recognition, localization and detection using convolutional networks"(피. 서머넛(P. Sermanet) 등, 2nd International Conference on Learning Representations, International Conference on Learning Representations, 2014)에 설명되어 있고, 그 내용은 이로써 참조로 포함된다.In some embodiments, one or more of the layers of the neural network 106 may be a 1x1 convolutional layer. For example, the final layer of a neural network may contain one or more 1x1 convolutions. Examples of such layers are "Fully Convolutional Networks for Semantic Segmentation" (E. Shelhamer et al., IEEE Trans. Pattern Anal. Mach. Intelll. 39(4): 640-651 (2017)) and "Overfeat" : Integrated recognition, localization and detection using convolutional networks" (P. Sermanet et al., 2nd International Conference on Learning Representations, International Conference on Learning Representations, 2014), the contents of which are hereby incorporated by reference. do.

출력 데이터(108)는 각각의 검출된 객체(110)에 대한 메쉬의 표현을 포함한다. 이는 객체의 형상을 캡처하는 K = N x 3차원 벡터의 형태일 수 있고, 여기서 N은 메쉬 내의 노드들의 개수이다. N은 563일 수 있지만, 다른 개수의 메쉬 노드들이 가능하다. 출력 데이터(108)는 객체(예를 들어, 사람)(114)의 존재의 확률을 더 포함할 수 있다. 출력 데이터(108)는 입력 이미지(102) 내의 객체들의 경계 박스들(112)의 코너들을 더 포함할 수 있다.The output data 108 includes a representation of the mesh for each detected object 110 . It may be in the form of a K = N x 3D vector that captures the shape of the object, where N is the number of nodes in the mesh. N may be 563, but other numbers of mesh nodes are possible. The output data 108 may further include a probability of the existence of an object (eg, a person) 114 . The output data 108 may further include corners of the bounding boxes 112 of objects in the input image 102 .

일부 실시예들에서, 신경망(106)의 예측들은 객체 유형에 특정적일 수 있다. 예를 들어, 신경망(106)은 인간들에 대한 얼굴 부분들, 손가락들, 및 팔다리의 3D 위치들을 예측할 수 있는 반면, 상이한 신경망(106)은 자동차들에 대한 휠, 윈도우, 도어 및 라이트 위치들에 전용될 수 있다. 식별된 객체 유형으로 인해 존재할 것으로 예상되는 객체 랜드마크들이 이미지에 존재하지 않을 수 있다는 것을 이해해야 한다. 예를 들어, 객체 유형이 인간인 경우, 인체의 부분들은 뷰를 방해하는 자세 또는 다른 물품들로 인해 가시적이지 않을 수 있다.In some embodiments, the predictions of the neural network 106 may be object type specific. For example, neural network 106 may predict 3D positions of face parts, fingers, and limbs for humans, whereas different neural network 106 may predict wheel, window, door and light positions for automobiles. can be dedicated It should be understood that object landmarks that are expected to be present due to the identified object type may not be present in the image. For example, if the object type is a human, parts of the anatomy may not be visible due to postures or other items that obstruct the view.

위에서 설명된 바와 같은 신경망(106)의 이용은 여러 장점들을 초래할 수 있다. 첫째, 추론이 효율적이고, 고정확도 포즈 추정을 위한 디콘볼루션 기반의 디코더들에 대한 필요성을 우회한다. 대신에, 이러한 아키텍처는 "인코더-전용"이고, M x M(예를 들어, 233 x 233) 패치를 처리하는데 요구되는 시간을 수 밀리초로 줄인다. 둘째, 결과적인 망들은, 이전 연구들에서 이용된 의장 신체 모델들과는 대조적으로, 일반 콘볼루션 망 계층들에 배타적으로 의존하기 때문에, 모바일 디바이스들에 용이하게 휴대가능하다. 셋째, 개별 패치들을 처리하는 대신에, 이미지 전체를 완전 콘볼루션 방식으로 처리할 수 있도록, 결과적인 모델을 확장하는 것이 간단해진다.The use of neural network 106 as described above can result in several advantages. First, the inference is efficient and bypasses the need for deconvolution-based decoders for high-accuracy pose estimation. Instead, this architecture is "encoder-only", reducing the time required to process an M x M (eg, 233 x 233) patch to a few milliseconds. Second, the resulting networks are easily portable to mobile devices because, in contrast to the chair body models used in previous studies, they rely exclusively on generic convolutional network layers. Third, it simplifies extending the resulting model so that the entire image can be processed in a fully convolutional manner instead of processing individual patches.

단순성을 넘어서, 결과적인 아키텍처는 또한, 재구성된 3D 형상들의 시간적 평활화를 수행하는 것을 간단하게 한다. 이전에 이용된 복잡한 파라미터 추적 기반 방법들 대신에, 망(106)의 끝에서 두 번째 계층 활성화들의 이동 평균을 취할 수 있다. 이는, 모바일 디바이스에 처리 단계로서 통합하기 위한 노력을 실질적으로 요구하지 않으면서, 망에 의해 복구된 3D 형상들을 실질적으로 안정화한다.Beyond simplicity, the resulting architecture also simplifies performing temporal smoothing of the reconstructed 3D shapes. Instead of the complex parameter tracking based methods used previously, one can take a moving average of the second-to-last layer activations of the network 106 . This substantially stabilizes the 3D shapes recovered by the network, while requiring substantially no effort to integrate as a processing step into the mobile device.

일부 실시예들에서, 신경망(106)은 객체(예를 들어, 사람의 흉골) 내의 미리 결정된 포인트와 정렬된 위치(i)에서 계산된 고차원 특징 벡터(F)에 기초하여 객체(예를 들어, 사람)의 전체 객체 메쉬(V)를 출력할 수 있다. 특징 벡터는 신경망 내의 계층에 의해 출력될 수 있고, 예를 들어, 망 내의 은닉 계층의 활성화들에 대응할 수 있다. 맵핑은 다음과 같이 표현될 수 있다:In some embodiments, neural network 106 configures an object (eg, a human sternum) based on a high-dimensional feature vector (F) computed at a location (i) aligned with a predetermined point within the object (eg, the human sternum). The entire object mesh (V) of a person) can be output. The feature vector may be output by a layer in the neural network, and may correspond to, for example, activations of a hidden layer in the network. The mapping can be expressed as:

여기서 M은 특징 벡터(F)를 출력 메쉬(V)에 맵핑하는 것을 나타낸다. M은, 일부 실시예들에서, 신경망에서 선형 계층으로서 구현될 수 있다.where M denotes the mapping of the feature vector (F) to the output mesh (V). M may, in some embodiments, be implemented as a linear layer in a neural network.

이 '중앙집중식'(즉, 객체 내의 미리결정된 단일 포인트와 정렬됨) 방법의 단순성은, 계산의 병목현상으로서 역할할 수 있는, F[i]를 통한 객체(예를 들어, 인간) 메쉬의 세부사항들 및 변동들 모두에 대한 계수의 난제에 의해 상쇄된다. 그러한 방법에서, 출력 메쉬(110) 내의 주어진 노드의 위치는 상기 메쉬 노드로부터 멀리 떨어진 위치들과 연관된 신경망 내의 뉴런의 활성화들에 기초할 수 있다. 이는 결과적인 출력 메쉬(110)에 부정확성을 도입할 수 있다.The simplicity of this 'centralized' (i.e., aligned with a predetermined single point within the object) method is that the detail of the object's (e.g. human) mesh through F[i] can serve as a computational bottleneck. It is offset by the difficulty of coefficients for both factors and variations. In such a way, the location of a given node in the output mesh 110 may be based on activations of neurons in the neural network associated with locations remote from the mesh node. This may introduce inaccuracies in the resulting output mesh 110 .

일부 실시예들에서, 부분 기반 방법은 대안적으로, 객체의 3D 표현(110)을 재구성하는 데 이용될 수 있다. 부분 기반 접근법은 중앙집중식 방법보다 실질적으로 더 정확할 수 있는 메쉬 회귀에 대한 분산 접근법을 이용한다. 부분 기반 접근법은 객체들(104)이 부분들로 분할될 수 있는 상황들에서 유용하다. 예를 들어, 인체는 자연스럽게 부분들, 예컨대, 머리, 몸통, 팔, 손 등으로 분할될 수 있다.In some embodiments, the part-based method may alternatively be used to reconstruct the 3D representation 110 of the object. Partial-based approaches use a distributed approach to mesh regression, which can be substantially more accurate than centralized methods. The part-based approach is useful in situations where objects 104 may be partitioned into parts. For example, the human body may naturally be divided into parts, eg, head, torso, arms, hands, and the like.

인체를 예로서 취하면, 미리 결정된 포인트(예를 들어, 흉골)로부터 먼 객체 부분들(예를 들어, 손, 발 등)에 가까운 위치들과 연관된 뉴런들은 더 먼 위치들과 연관된 뉴런들보다 각각의 메쉬 부분들에 대한 더 신뢰성 있는 추정들을 전달할 수 있다. 이러한 더 가까운 뉴런들은 그들의 근방에 있는 신체 부분들에 관한 신뢰성 있는 정보를 제공하는 "분산 부분 전문가들"처럼 작용한다. 이러한 분산 부분 전문가들로부터의 정보는, 출력 메쉬(110)를 결정하기 위해, 미리 결정된 포인트(예를 들어, 흉골)에 위치된 코디네이터 노드를 이용하여 결합될 수 있다. 상기 부분으로부터 더 먼 위치들과 연관된(즉, 상기 부분과 연관되지 않은) 뉴런들로부터 오는, 주어진 부분과 연관된 특정 메쉬 노드에 관한 정보는 출력 메쉬(110)의 결정 동안 억제/폐기될 수 있다. 이는 부분 레벨 주의 행렬(A)을 이용하여 달성될 수 있다.Taking the human body as an example, neurons associated with locations closer to object parts (eg, hand, foot, etc.) distal from a predetermined point (eg, sternum) are each more distant than neurons associated with locations more distant. can deliver more reliable estimates for the mesh portions of . These closer neurons act like "distributed part experts" providing reliable information about body parts in their vicinity. Information from these distributed segment experts may be combined using a coordinator node located at a predetermined point (eg, the sternum) to determine the output mesh 110 . Information about a particular mesh node associated with a given part, coming from neurons associated with locations further away from the part (ie not associated with the part), may be suppressed/discarded during the determination of the output mesh 110 . This can be achieved using a sub-level attention matrix (A).

코디네이터에 정보를 제공하는 부분 위치들은 인간 골격 계산 스테이지에 의해 결정되고, 흉골 위치를 대응하는 부분 위치들과 연관시킬 수 있다. 부분 위치들은, 예를 들어, 인체의 관절 위치들에 기초할 수 있다.The partial positions providing information to the coordinator are determined by the human skeletal calculation stage and may associate the sternum position with the corresponding partial positions. The partial positions may be based on joint positions of the anatomy, for example.

부분 기반 접근법에서, 신경망(106)은 객체 부분들의 미리 결정된 목록 내의 객체(104)의 각각의 부분(p)에 대해 개별 부분 메쉬(V[p])를 출력한다. 각각의 부분 메쉬(V[p])는 그 원점이 기준점(본원에서 "코디네이터 노드"로 또한 지칭됨)(c)(예를 들어, 인간의 흉골)에 대해 오프셋된 그 자신의 좌표계에 제공된다. 즉, {p₁, ... p_M}은 코디네이터(예를 들어, 흉골) 위치(c)와 연관된 M개의 부분들 각각의 위치들을 나타낸다. 상기 부분 위치들은 객체들의 키 포인트들, 예를 들어, 인체 내의 관절 위치들에 기초할 수 있다.In the part-based approach, the neural network 106 outputs a respective part mesh V[p] for each part p of the object 104 in a predetermined list of object parts. Each partial mesh V[p] is provided in its own coordinate system whose origin is offset with respect to a reference point (also referred to herein as a “coordinator node”) c (eg, the human sternum). . That is, {p ₁ , ... p _M } represents positions of each of the M parts associated with the coordinator (eg, sternum) position (c). The partial positions may be based on key points of objects, for example, joint positions within the human body.

부분 메쉬들(V[p])로부터 전체 메쉬(V)를 재구성하기 위해, 코디네이터는 부분 레벨 주의 행렬(A)를 이용하여 부분(p)에 관련된 노드들을 V[p]로부터 선택한다. 일부 실시예들에서, A는 어느 부분이 최종 메쉬(V) 내의 각각의 정점에 대해 이용되어야 하는지를 나타내는 이진 행렬이다. 그러한 주의 행렬의 경우, 최종 메쉬(V)는 다음을 이용하여 최종 메쉬 내의 각각의 메쉬 정점(v)에 대해 재구성될 수 있다:In order to reconstruct the whole mesh V from the partial meshes V[p], the coordinator selects the nodes related to the part p from V[p] using the partial level attention matrix A. In some embodiments, A is a binary matrix indicating which part should be used for each vertex in the final mesh (V). For such an attention matrix, the final mesh V can be reconstructed for each mesh vertex v in the final mesh using:

여기서, V[c,v]는 미리 결정된 포인트(c)에 대한 메쉬 노드(v)의 위치를 나타내고, p_v는 노드(v)(즉, 이진 예에서 A[p,v]=1인 노드)에 대한 "활성 전문가"인 부분의 위치를 나타내고, V[p,v]는 부분(p)과 연관된 메쉬 내의 메쉬 노드(v)의 위치를 나타낸다. (c-p_v)는 부분과 기준 위치(c)(예를 들어, 흉골) 사이의 상대 위치를 설명하는 오프셋 벡터이다.where V[c,v] denotes the position of the mesh node v with respect to the predetermined point c, and p _v is the node v (i.e., the node with A[p,v]=1 in the binary example). ) denotes the position of the part that is an "active expert" for , and V[p,v] denotes the position of the mesh node v in the mesh associated with the part p. (cp _v ) is an offset vector describing the relative position between the portion and the reference position c (eg, the sternum).

일부 실시예들에서, 부분 레벨 주의 행렬(A)은 훈련 동안에 결정된 학습된 행렬일 수 있다. 그러한 예에서, 주어진 메쉬 노드(v)의 위치는 하나 이상의 부분으로부터의 정보에 기초할 수 있다. 따라서, 식(3)보다는 식(2)을 이용하면, 하나 초과의 부분에 관한 가중된 합을 초래할 수 있다.In some embodiments, the sub-level attention matrix A may be a learned matrix determined during training. In such an example, the location of a given mesh node v may be based on information from one or more portions. Thus, using equation (2) rather than equation (3) may result in a weighted sum for more than one part.

일부 실시예들에서, 출력 데이터(108)는, 일부 실시예들에서, 입력 2D 이미지(102) 내의 잠재적 객체들(104)의 경계 박스들의 추정된 좌표들(112) 및/또는 입력 2D 이미지(102) 내의 위치들에서의 객체들(104)의 존재의 확률들을 더 포함한다.In some embodiments, the output data 108 is, in some embodiments, the estimated coordinates 112 of the bounding boxes of potential objects 104 in the input 2D image 102 and/or the input 2D image ( and probabilities of the existence of objects 104 at locations within 102 .

입력 2D 이미지(102) 내의 각각의(잠재적) 식별된 객체(104)의 추정된 3D 표현(110)은, 일부 실시예들에서, N-정점 메쉬(V)의 형태이다. 예를 들어, 각각의 식별된 객체(104)의 추정된 3D 표현(110)은 메쉬 내의 N개의 정점들의 3D 좌표들을 제공하는 (Nx3)-차원 벡터의 형태일 수 있다. 일부 실시예들에서, N=536이지만, 다른 개수의 메쉬 노드들이 가능하다.The estimated 3D representation 110 of each (potentially) identified object 104 in the input 2D image 102 is, in some embodiments, in the form of an N-vertex mesh (V). For example, the estimated 3D representation 110 of each identified object 104 may be in the form of an (Nx3)-dimensional vector providing the 3D coordinates of the N vertices in the mesh. In some embodiments, N=536, but other numbers of mesh nodes are possible.

경계 박스들의 좌표들은, 각각의 잠재적 객체에 대해, 입력 2D 이미지(102) 내의 경계 박스의 정점들의 x 및 y 위치들, 예를 들어, 각각의 잠재적 객체에 대해 {(x₁, y₁), (x₂, y₂), (x₃, y₃), (x₄, y₄)}를 포함할 수 있다. 각각의 경계 박스는, 경계 박스가, 주어진 유형(예를 들어, 사람)의 객체를 포함할 대응하는 확률과 연관될 수 있다.The coordinates of the bounding boxes are, for each potential object, the x and y positions of the vertices of the bounding box in the input 2D image 102 , e.g., {(x ₁ , y ₁ ), for each potential object (x ₂ , y ₂ ), (x ₃ , y ₃ ), (x ₄ , y ₄ )}. Each bounding box may be associated with a corresponding probability that the bounding box will contain an object of a given type (eg, a person).

입력 2D 이미지(102) 내의 장면의 3D 재구성(116)은 원근 투영을 고려하고 세계내 좌표들을 추정할 때 그의 효과들을 무효화함으로써 생성된다. 3D 재구성(116)은 이미지(102)에서 식별된 객체들(104)(예를 들어, 바닥, 지면)을 물리적으로 지지하는 3D 평면(118)(본원에서 "지지 평면"으로 또한 지칭됨)을 추정하고 모든 3D 객체들(110)을 지지 평면에 대해 공간 내에 배치함으로써 생성된다.A 3D reconstruction 116 of the scene in the input 2D image 102 is created by taking into account the perspective projection and negating its effects when estimating the coordinates in the world. The 3D reconstruction 116 creates a 3D plane 118 (also referred to herein as a “support plane”) that physically supports the objects 104 (eg, floor, ground) identified in the image 102 . It is created by estimating and placing all 3D objects 110 in space with respect to the support plane.

이미지(102)에서 식별된 객체들(104)을 물리적으로 지지하는 3D 평면(118)을 추정하는 것은 객체들(예를 들어, 인간들)이 단일 물리적 평면에 의해 지지된다는 가정에 기초할 수 있다. 일부 실시예들에서, 이 가정은, 상이한 인간들을 상이한 평면들에 할당하기 위해, 혼합 모델들을 이용하고, 예상 극대화를 이용함으로써 완화된다.Estimating the 3D plane 118 that physically supports the objects 104 identified in the image 102 may be based on the assumption that the objects (eg, humans) are supported by a single physical plane. In some embodiments, this assumption is relaxed by using mixed models and using predictive maximization to assign different humans to different planes.

3D 평면(118)을 추정하는 것은 또한, 객체들(예를 들어, 인간들)의 높이들이 대략 동일하다고 가정하는 것에 기초할 수 있다. 이러한 후자의 가정은, 개별 객체들이 시간에 따라 추적될 수 있는 일련의 입력 이미지들(102)이 이용가능하고 각각의 객체에 대해 시간에 따른 원근 투영의 효과들이 모니터링되는 경우에 완화될 수 있다.Estimating the 3D plane 118 may also be based on assuming that the heights of objects (eg, humans) are approximately equal. This latter assumption may be relaxed if a series of input images 102 are available in which individual objects can be tracked over time and the effects of perspective projection over time are monitored for each object.

3D 객체들(110)을 지지 평면에 대해 공간 내에 배치하기 위해, 이를 세계 좌표들로 가져오기 위해 각각의 메쉬의 스케일링이 추정된다. 스케일은 입력 이미지(102)에서 장면을 캡처하는 카메라로부터의 객체(예를 들어, 사람)의 거리에 반비례한다. 이로써, 스케일링은 메쉬 정점들을 카메라 중심에 연결하는 라인을 따라 메쉬를 세계 좌표들로 위치시키는 데 이용될 수 있다.To place the 3D objects 110 in space with respect to the support plane, the scaling of each mesh is estimated to bring it into world coordinates. The scale is inversely proportional to the distance of the object (eg, a person) from the camera capturing the scene in the input image 102 . As such, scaling can be used to position the mesh in world coordinates along the line connecting the mesh vertices to the camera center.

예로서, 3D 객체들(110)은 정사 투영 하에서 추정된 메쉬들의 형태로 신경망(106)에 의해 제공될 수 있다. 정점들(m_i={v_i,1,..., v_i,K})을 갖는 장면에서 i번째 메쉬를 가정하면, 여기서 v_i,k = (x_i,k, y_i,k, z_i,k) ∈ R³는 스케일링된 정사영 및 스케일(s_i) 하에서 추정된 메쉬 좌표들이고, 메쉬 정점들은 원근 투영의 효과들을 무효화함으로써 3D 세계 좌표들에 위치된다. 예를 들어, i번째 메쉬에서 k번째 포인트의 세계 좌표들(V_i,k=(X_i,k, Y_,i,k, Z_,i,k))은, 스케일 인자의 역에 의해 세계 좌표들에서 메쉬 깊이를 "푸시 백(pushing back)"하고, X 및 Y 세계 좌표들을 이들이 x 및 y의 대응하는 픽셀 좌표 값들로 올바르게 다시 투영하도록 설정함으로써 스케일링된 정사영 하에서 추정된 대응하는 메쉬 좌표들로부터 추정될 수 있다. 상징적으로, 이는 다음과 같이 표현될 수 있다:As an example, 3D objects 110 may be provided by neural network 106 in the form of meshes estimated under orthographic projection. Assuming the i-th mesh in the scene with vertices (m _i ={v _i,1 ,..., v _i,K }), where v _i,k = (x _i,k , y _i,k , z _i,k ) ∈ R ³ are the mesh coordinates estimated under scaled orthographic projection and scale s _i , and mesh vertices are located in 3D world coordinates by negating the effects of perspective projection. For example, the world coordinates of the k-th point in the i-th mesh (V _i,k =(X _i,k , Y _,i,k , Z _,i,k )) are the world coordinates by the inverse of the scale factor. From the corresponding mesh coordinates estimated under the scaled orthographic projection by "pushing back" the mesh depth in can be estimated. Symbolically, this can be expressed as:

여기서, 카메라 교정 행렬은 중심(c_x= W/2 및 c_y= H/2, 여기서, W 및 H는 이미지 치수들, 즉, 이미지 폭 및 높이임) 및 초점 길이(f)를 갖는다. 이들은 수동으로 또는 카메라 교정에 의해 설정될 수 있다.Here, the camera calibration matrix has a center (c _x = W/2 and c _y = H/2, where W and H are image dimensions, i.e., image width and height) and focal length f. These can be set manually or by camera calibration.

식별된 객체에 대응하는 각각의 메쉬에 대해, 최하위(즉, 최하위 Y 값)가 결정되고 객체와 지지 평면(예를 들어, 바닥) 사이의 접촉 포인트를 추정하는데 이용된다. 일단 그러한 포인트들 중 적어도 4개가 결정되면(집합적으로 M으로 표기됨), 결정된 포인트들은 세계 좌표들에서 지지 평면을 추정하는 데 이용될 수 있다. 일부 실시예에서, 최소 제곱법이 세계 좌표들에서 평면을 추정하는 데 이용될 수 있는데, 예를 들어:For each mesh corresponding to the identified object, the lowest (ie, lowest Y value) is determined and used to estimate the point of contact between the object and the support plane (eg, floor). Once at least four of those points have been determined (collectively denoted M), the determined points can be used to estimate the support plane in world coordinates. In some embodiments, least squares method may be used to estimate the plane in world coordinates, for example:

여기서, V₁=(a, b, c)는 지지 평면에 수직인 벡터이다. 이 벡터는, 일부 실시예들에서, 정규화될 수 있다. 세계 좌표 축들(R=[v₁ ^T, v₂ ^T, v₃ ^T]^T)은, 2개의 상보적 방향들(V₂ 및 V₃) - 이 방향들 양쪽 모두는 V₁에 대해 그리고 서로에 대해 직교함 - 을 찾음으로써 정의된다. 일부 실시예들에서, V₂는 Z의 방향으로 선택되고 V₃=V₁xV₂이다. 벡터들(V₂ 및 V₃)은, 일부 실시예들에서, 정규화될 수 있는데, 즉, 세트 {V₁, V₂, V₃}는 직교 정규 기저를 형성한다.Here, V ₁ =(a, b, c) is a vector perpendicular to the support plane. This vector may, in some embodiments, be normalized. The world coordinate axes R=[v ₁ ^T , v ₂ ^T , v ₃ ^T ] ^T have two complementary directions V ₂ and V ₃ , both of which are relative to V ₁ and to each other. is orthogonal to - is defined by finding In some embodiments, V ₂ is selected in the direction of Z and V ₃ =V ₁ xV ₂ . Vectors V ₂ and V ₃ may, in some embodiments, be normalized, ie, the set {V ₁ , V ₂ , V ₃ } forms an orthogonal normal basis.

세계 좌표 중심(T)은 또한, 평면 상에 놓인 3D 포인트로서 할당될 수 있다. 예를 들어, 이는 카메라로부터 3미터인 포인트로 설정될 수 있고, y=H/2로 투영된다.The world coordinate center T can also be assigned as a 3D point lying on a plane. For example, this could be set to a point that is 3 meters from the camera, projected as y=H/2.

일부 실시예에서, 상기 내용은 세계 좌표계와 2D 이미지(102) 내의 픽셀 위치들 사이의 변환을 정의하는 데 이용될 수 있다. 예를 들어, 단일 3x4 원근 투영 행렬(P)에서 세계-대-카메라 변환 및 변환 및 카메라 교정 행렬은 다음과 같이 주어질 수 있다:In some embodiments, the above content may be used to define a transformation between the world coordinate system and pixel locations within the 2D image 102 . For example, the world-to-camera transformation and transformation and camera calibration matrices in a single 3x4 perspective projection matrix P can be given as:

여기서, 방향(V) 벡터들은, 역 회전 행렬이 이용되고 있기 때문에, 열들이 아니라 행들로서 나타나고, R^-1= R^T이다.Here, the direction (V) vectors appear as rows rather than columns, since an inverse rotation matrix is being used, R ⁻¹ = R ^T .

동차 좌표들을 이용하여, 세계 좌표들(C)은 다음을 이용하여 픽셀 좌표들(c)로 변환될 수 있다:Using homogeneous coordinates, world coordinates (C) can be transformed into pixel coordinates (c) using:

좌표 변환은, 일부 실시예들에서, 물리학의 법칙들을 준수하고/거나 다른 객체들과 의미있는/현실적인 방식으로 상호작용하는 객체들을 장면/이미지에 도입하는 데 이용될 수 있다. 이는, 실세계 좌표들을 요구하는 상호작용 애플리케이션들, 예를 들어, 증강 현실 게임들, 예컨대, 플레이어가 검 또는 레이저 빔으로 다른 플레이어를 치려고 시도하는 게임들로 이어질 수 있다. 일부 실시예들에서, 3D 재구성된 메쉬들이 입력 이미지(102)에 다시 투영될 수 있고, 3D 인간 포즈 추정들의 정확도를 보여준다.Coordinate transformation may, in some embodiments, be used to introduce objects into a scene/image that obey the laws of physics and/or interact with other objects in a meaningful/realistic way. This can lead to interactive applications that require real-world coordinates, eg, augmented reality games, eg, games in which a player attempts to hit another player with a sword or laser beam. In some embodiments, 3D reconstructed meshes may be projected back to the input image 102 , showing the accuracy of the 3D human pose estimates.

다수의 입력 2D 이미지들(102)이 이용되는 실시예들에서, 다수를 지지하는 평면을 추정하는 단계는, 상대적 카메라 포즈 추정 및 연속적인 프레임들의 포인트들 사이의 대응관계들을 이용한 평면 위치파악을 이용하여 일련의 프레임들에 대해 수행될 수 있다.In embodiments where multiple input 2D images 102 are used, estimating the plane supporting the multiple uses relative camera pose estimation and plane localization using correspondences between points in successive frames. Thus, it can be performed on a series of frames.

처리(예를 들어, 후처리)는, 연속적인 프레임들에서 은닉 계층 응답들을, 예를 들어, 그들을 평균화하는 것에 의해 결합함으로써 다수의 연속적인 프레임들에 걸쳐 발생할 수 있다.Processing (eg, post-processing) may occur over multiple consecutive frames by combining the hidden layer responses in consecutive frames, eg, by averaging them.

예시적인 실시예들 중 일부에 따르면, 합성 객체들은 입력 이미지 내의 객체들(예를 들어, 사람)에 의해 제어되거나, 그들과 상호작용할 수 있다. 예를 들어, 추가된 합성 객체는 사람이 컴퓨터 게임에서 그의 팔로 제어하고 있는 검, 또는 레이저 빔, 또는 사람을 향해 이동하고 추정된 3차원 사람 위치와 접촉할 때 다시 튕기는 객체일 수 있다.According to some of the example embodiments, the composite objects may be controlled by, or interact with, objects (eg, a person) in the input image. For example, the added composite object could be a sword, or laser beam, that the person is controlling with his arm in a computer game, or an object that moves towards the person and bounces back when it comes into contact with an estimated three-dimensional person location.

이 방법은 그래픽 엔진, 예컨대, 유니티(Unity)에 통합될 수 있다. 이용자는 그 자신의 이미지들/비디오와 실시간으로 상호작용하고/거나 그 자신에 중첩된 메쉬들을 볼 수 있다.This method may be integrated into a graphics engine, such as Unity. The user can interact in real time with his own images/videos and/or view meshes superimposed on himself.

예를 들어, 비디오에서, 연속적인 입력 이미지들(102)로부터 취해진 시간에 걸친 다수의 측정들을 결합하는 것은, 동시적 위치파악 및 맵핑(SLAM)이 허용하는 바와 같이, 방법들을 카메라 추적과 조합하면서, 더 개선된 추정들을 제공한다.For example, in video, combining multiple measurements over time taken from successive input images 102 can be accomplished while combining methods with camera tracking, as simultaneous localization and mapping (SLAM) allows. , provides more improved estimates.

인간들과 같은 변형가능 객체들과 강건한 3D 장면들의 조합의 경우, 양쪽 모두 미터법 좌표들로 재구성된다.In the case of a combination of deformable objects such as humans and robust 3D scenes, both are reconstructed into metric coordinates.

도 2는 단일 2차원 이미지로부터 다수의 객체들을 갖는 장면의 3차원 재구성을 생성하기 위한 예시적인 방법의 흐름도를 도시한다. 방법은 하나 이상의 위치에서 작동하는 하나 이상의 컴퓨터에 의해 수행될 수 있다. 방법은 도 1과 관련하여 설명된 방법에 대응할 수 있다.2 depicts a flow diagram of an exemplary method for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image. The methods may be performed by one or more computers operating at one or more locations. The method may correspond to the method described with respect to FIG. 1 .

작동(2.1)에서, 2D 이미지가 수신된다. 2D 이미지는 픽셀 값들의 어레이, 예를 들어, RGB 이미지를 포함한다. 단일 2D 입력 이미지가 수신될 수 있다. 일부 실시예들에서, 일련의 단일 이미지들이 수신될 수 있다.In operation (2.1), a 2D image is received. A 2D image comprises an array of pixel values, for example an RGB image. A single 2D input image may be received. In some embodiments, a series of single images may be received.

작동(2.2)에서, 재구성될 이미지 내의 객체들이 식별된다. 객체 유형들이 또한 식별될 수 있다. 작동(2.2)은 신경망, 예컨대, 도 1과 관련하여 위에서 설명된 신경망의 하나 이상의 계층에 의해 수행될 수 있다.In operation 2.2, objects in the image to be reconstructed are identified. Object types may also be identified. Operation 2.2 may be performed by a neural network, eg, one or more layers of the neural network described above with respect to FIG. 1 .

작동(2.3)에서, 각각의 식별된 객체의 3D 표현이 추정된다. 3D 표현은 메쉬의 형태, 예를 들어, 3D에서 N개의 정점 위치들의 K=Nx3차원 벡터일 수 있다. 작동(7.3)은 신경망, 예컨대, 도 1과 관련하여 위에서 설명된 신경망의 하나 이상의 계층에 의해 수행될 수 있다.In operation (2.3), a 3D representation of each identified object is estimated. The 3D representation may be in the form of a mesh, for example, a K=Nx3D vector of N vertex positions in 3D. Operation 7.3 may be performed by a neural network, eg, one or more layers of the neural network described above with respect to FIG. 1 .

3D 표현을 추정하는 작동은 심층 기계 학습 모델, 예를 들어, 신경망에서 수행될 수 있고, 이는 출력 계층 및 출력을 생성하기 위해, 수신된 입력에 비선형 변환을 각각 적용하는 하나 이상의 은닉 계층을 포함한다. 심층 기계 학습 모델은 신경망의 하나 이상의 중간 계층으로부터 특징 데이터를 연결함으로써 다수의 객체들의 3D 랜드마크 위치들을 예측할 수 있다. 예측된 3D 위치들은 각각의 영역에 묘사된 객체의 예측된 유형에 대해 동시에 추정된다.The operation of estimating the 3D representation may be performed in a deep machine learning model, e.g., a neural network, comprising an output layer and one or more hidden layers each applying a non-linear transformation to the received input to produce an output. . A deep machine learning model can predict 3D landmark positions of multiple objects by concatenating feature data from one or more intermediate layers of a neural network. Predicted 3D positions are simultaneously estimated for the predicted type of object depicted in each region.

일련의 입력 이미지들이 수신되는 경우, 다수의 객체들의 3차원 표현들을 추정하고 이들을 평면 상에 배치하는 단계가 각각의 수신된 이미지에 대해 수행된다. 이미지들이 실질적으로 실시간으로(예를 들어, 30 fps) 수신되는 경우, 이러한 작동들은 실질적으로 실시간으로 수행될 수 있다.When a series of input images are received, a step of estimating three-dimensional representations of multiple objects and placing them on a plane is performed for each received image. If the images are received in substantially real-time (eg, 30 fps), these operations may be performed in substantially real-time.

작동(2.4)에서, 모든 3차원 객체들을 물리적으로 지지하는 3D 평면이 추정된다. 작동(2.4)은 후처리 단계에서, 즉, 신경망에 의한 입력 이미지의 처리 후에 수행될 수 있다. 지지 프레임을 추정하는 방법들은 도 1을 참조하여 위에서 더 상세히 설명된다.In operation (2.4), a 3D plane that physically supports all three-dimensional objects is assumed. Operation 2.4 may be performed at a post-processing stage, ie after processing of the input image by the neural network. Methods for estimating the support frame are described in more detail above with reference to FIG. 1 .

다수의 객체들을 지지하는 평면을 추정하는 것은, 모든 가시 객체들을 통과하는 2차원 평면을 재구성하기 위해 모든 가시 객체들의 추정된 3차원 위치들을 이용하여 단일 프레임에 대해 수행될 수 있다. 예를 들어, 지지 평면은 객체들과 평면 사이의 접촉 포인트들의 추정된 위치들, 예를 들어, 입력 이미지에서 식별된 사람들의 발들의 위치들에 기초하여 추정될 수 있다. 다수의 객체들을 지지하는 평면을 추정하는 것은, 상대적 카메라 포즈 추정 및 연속적인 프레임들의 포인트들 사이의 대응관계들을 이용한 평면 위치파악을 이용하여 일련의 프레임들에 대해 수행될 수 있다.Estimating a plane supporting multiple objects may be performed for a single frame using the estimated three-dimensional positions of all visible objects to reconstruct a two-dimensional plane passing through all visible objects. For example, the support plane may be estimated based on estimated positions of contact points between the objects and the plane, eg, positions of the feet of people identified in the input image. Estimating the plane supporting multiple objects may be performed for a series of frames using relative camera pose estimation and plane localization using correspondences between points of successive frames.

작동(2.5)에서, 3D 객체들은 지지 평면에 대해 공간 내에 위치된다. 작동(2.5)은 후처리 단계에서, 즉, 신경망에 의한 입력 이미지의 처리 후에 수행될 수 있다. 3D 객체들을 지지 프레임에 대해 위치시키는 방법들은 도 1을 참조하여 위에서 더 상세히 설명된다.In operation 2.5, the 3D objects are positioned in space with respect to the support plane. Operation 2.5 may be performed in a post-processing step, ie after processing of the input image by the neural network. Methods of positioning 3D objects relative to the support frame are described in more detail above with reference to FIG. 1 .

도 3은 단일 2차원 이미지로부터 다수의 객체들을 갖는 3차원 장면을 재구성하기 위해 신경망을 훈련시키기 위한 예시적인 방법(300)의 개략적인 개요를 도시한다. 방법은 하나 이상의 위치에서 작동하는 하나 이상의 컴퓨터에 의해 수행될 수 있다.3 shows a schematic overview of an exemplary method 300 for training a neural network to reconstruct a three-dimensional scene with multiple objects from a single two-dimensional image. The methods may be performed by one or more computers operating at one or more locations.

2D 훈련 이미지(302)는 복수의 2D 이미지들을 포함하는 훈련 데이터세트로부터 획득된다. 2D 이미지들은 주어진 유형(예를 들어, 한 명 이상의 인간)의 하나 이상의 객체(304)를 포함한다. 3D 변형가능 모델은 단안 3D 재구성을 위한 하나 이상의 감독 신호(306)를 추출하기 위해 입력 이미지(302) 내의 객체들에 피팅(예를 들어, 반복적으로 피팅)된다. 입력 훈련 이미지(302) 내의 각각의 (예상) 객체의 후보 3D 표현들을 포함하는 출력 데이터(310)를 생성하기 위해, 훈련 이미지(302)가 신경망(308)에 입력되고 신경망(308)의 현재 파라미터들을 이용하여 일련의 신경망 계층들을 통해 처리된다. 출력 데이터는 신경망(308)에 대한 파라미터 업데이트들(312)을 결정하기 위해 입력 이미지(302)의 감독 신호들(306)과 비교된다.The 2D training image 302 is obtained from a training dataset including a plurality of 2D images. The 2D images include one or more objects 304 of a given type (eg, one or more humans). The 3D deformable model is fitted (eg, iteratively fitted) to objects in the input image 302 to extract one or more supervisory signals 306 for monocular 3D reconstruction. A training image 302 is input to a neural network 308 and current parameters of the neural network 308 to generate output data 310 comprising candidate 3D representations of each (expected) object in the input training image 302 . It is processed through a series of neural network layers using The output data is compared with the supervisory signals 306 of the input image 302 to determine parameter updates 312 for the neural network 308 .

방법은 임계 조건이 만족될 때까지 반복적으로 수행될 수 있다.The method may be iteratively performed until a threshold condition is satisfied.

입력 이미지들(302) 및 신경망(308)은 도 1과 관련하여 설명된 바와 동일한 형태일 수 있다.The input images 302 and the neural network 308 may be of the same form as described with respect to FIG. 1 .

변형가능 모델 피팅/최적화 스테이지 동안, 이용가능한 2D 및 3D 실측 정보 데이터 전부가, 추정된 3D 형상(즉, 감독 신호(306))을 획득하기 위해 최소화되는 에너지 항들을 제공하는 데 이용된다. 일 예로서, 2D 이미지들로의, 변형가능 모델들의 CNN 방식의 반복 피팅을 이용하는, "Holopose: Holistic 3d human reconstruction in-the-wild"(알. 알프 굴러(R. Alp Guler) 및 아이. 코키노스(I. Kokkinos), The IEEE Conference on Computer Vision and Pattern Recognition(CVPR)에서, 2019년 6월. 2, 3, 그 내용이 본원에 참조로 포함됨)에 설명된 방법이 이용될 수 있다.During the deformable model fitting/optimization stage, all of the available 2D and 3D ground truth data is used to provide energy terms that are minimized to obtain an estimated 3D shape (ie, supervisory signal 306 ). As an example, "Holopose: Holistic 3d human reconstruction in-the-wild" (R. Alp Guler and I. Koki), using CNN-style iterative fitting of deformable models to 2D images. The method described in I. Kokkinos, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2, 3, the contents of which is incorporated herein by reference) may be used.

3차원 모델을 피팅하는 단계는 3차원 표현을 2차원 이미지 평면 상에 투영하여 투영된 표현을 생성함으로써 수행될 수 있다. 투영하는 단계는, 원근 투영의 효과들을 고려함으로써, 그리고, 다수의 뷰들이 이용가능한 경우, 동일한 객체의 다수의 뷰들을 이용함으로써 수행될 수 있다. 그 후, 이는 투영된 표현의 각각의 위치들을 단일 2차원 이미지 내의 객체와 비교할 수 있고, 비교에 기초하여, 예를 들어, 훈련 이미지(302) 및 재투영된 이미지 내의 객체 랜드마크 위치들의 차이를 결정함으로써 에러 값이 결정될 수 있다. 그 후, 융합된 3차원 표현의 파라미터들이 에러 값에 기초하여 조정된다. 투영, 비교, 측정 및 조정은, 측정된 에러 값이, 미리 결정된 임계 값 미만이거나 반복들의 임계 횟수가 초과될 때까지 반복적으로 반복될 수 있다.Fitting the three-dimensional model may be performed by projecting the three-dimensional representation onto a two-dimensional image plane to generate a projected representation. The projecting step may be performed by taking into account the effects of perspective projection, and, if multiple views are available, using multiple views of the same object. It can then compare the respective positions of the projected representation to the object in a single two-dimensional image, and based on the comparison, for example, the difference of the object landmark positions in the training image 302 and the reprojected image. By determining the error value can be determined. Then, the parameters of the fused three-dimensional representation are adjusted based on the error value. The projection, comparison, measurement and adjustment may be iteratively repeated until the measured error value is below a predetermined threshold value or a threshold number of iterations is exceeded.

일단 변형가능 모델 피팅이 수렴하면, 추정된 3D 표면은 망(308) 훈련을 위한 감독 신호(306) 또는 목표로서 사용된다. 예를 들어, 객체들(304)(예를 들어, 인체들)의 "저-폴리" 표현은, 중요한 객체(예를 들어, 인간 랜드마크들(예를 들어, 얼굴 부분들, 팔다리 관절들 등))를 커버하는 N개(예를 들어, N=536)의 정점들의 관점으로 메쉬를 나타낸다.Once the deformable model fitting converges, the estimated 3D surface is used as a supervisory signal 306 or target for training the network 308 . For example, a “low-poly” representation of objects 304 (eg, anatomy) is a representation of a significant object (eg, human landmarks (eg, face parts, limb joints, etc.) ))), representing the mesh in terms of N (eg, N=536) vertices.

이미지에 대한 변형가능 모델의 피팅들로부터, 이미지에 존재하는 각각의 객체 주위의 그 정점들의 추정된 3D 위치들이 측정된다. 그 후, 신경망(308)은 이 K=Nx3차원 벡터를 그의 수용 필드에 객체(304)와 함께 존재할 때 회귀시키도록 훈련된다. 다수의 객체들의 존재 시에, 상이한 응답들이, 상이한 이미지 위치들에서 예상된다. 입력 이미지(302) 내의 객체들의 경계 박스들의 위치들 및/또는 입력 이미지 내의 위치들에서의 객체들의 존재/부재가 감독 신호들(306)로서 추가적으로 이용될 수 있다.From the fittings of the deformable model to the image, the estimated 3D positions of its vertices around each object present in the image are measured. The neural network 308 is then trained to regress this K=Nx3 dimensional vector when present with the object 304 in its receptive field. In the presence of multiple objects, different responses are expected at different image locations. The positions of the bounding boxes of objects in the input image 302 and/or the presence/absence of objects at positions in the input image may additionally be used as supervisory signals 306 .

영역 특정 특징들을 추출하는 작업은, 점점 더 큰 수용 필드들을 갖는 연속적인 계층들의 뉴런들에 위임된다. 속도와 정확도의 우수한 절충을 갖는 레스넷-50 백본이 이용될 수 있다. 아트로스 콘볼루션들이 이용될 수 있고, 사람 가설들이 평가되는 공간 밀도의 증가, 및 누락된 검출들의 개수의 감소를 허용한다. 결과적인 완전 콘볼루션 계층의 최종 계층은, 각각의 위치에 있다고 가정된 사람의 속성들에 대응하는, 그 뉴런들 각각에서의 다수의 출력들을 예측하는 작업을 맡는다. 일부 실시예들에서, 출력 데이터(310)는 객체(예를 들어, 사람)의 존재의 확률을 더 포함한다. 객체 경계 박스의 코너들이 또한, 회귀되고 출력 데이터(310)의 일부를 형성할 수 있다.The task of extracting region-specific features is delegated to neurons in successive layers with increasingly larger receptive fields. A Lesnet-50 backbone with a good compromise between speed and accuracy can be used. Atros convolutions can be used, allowing an increase in the spatial density at which human hypotheses are evaluated, and a decrease in the number of missing detections. The final layer of the resulting fully convolutional layer is responsible for predicting multiple outputs from each of its neurons, corresponding to the properties of the person assumed to be at each location. In some embodiments, the output data 310 further includes a probability of the existence of an object (eg, a person). Corners of the object bounding box may also be regressed and form part of the output data 310 .

입력 이미지(302)의 감독 신호들(306)에 대한 출력 데이터의 비교는 손실/목적 함수(L)를 이용하여 수행될 수 있다. 손실 함수는 감독 신호(306)에서 후보 3D 표현과 3D 표현 사이의 차이들에 불이익을 주는 항들을 포함한다. 그러한 손실 함수의 많은 예들, 예컨대, L₁ 또는 L₂ 손실들이 관련 기술분야에 알려져 있다. 손실 함수는 또한, 신경망(308)에 의해 출력된 후보 경계 박스들의 위치와 입력 훈련 이미지(302) 내의 실측 정보 경계 박스 위치들의 차이들에 불이익을 주는 항들을 포함할 수 있다. 그러한 손실 함수의 많은 예들, 예컨대, L₁ 또는 L₂ 손실들이 관련 기술분야에 알려져 있다. 손실 함수는 또한, 그 위치에서의 객체들의 실제 존재/부재에 기초하여 신경망(308)의 출력 확률들에 불이익을 주는 항들을 포함할 수 있다. 그러한 손실 함수의 많은 예들, 예컨대, 분류 손실 함수들이 관련 기술분야에 알려져 있다.Comparison of the output data to the supervisor signals 306 of the input image 302 may be performed using a loss/objective function (L). The loss function includes terms that penalize differences between the candidate 3D representation and the 3D representation in the supervisory signal 306 . Many examples of such loss functions are known in the art, eg, L ₁ or L ₂ losses. The loss function may also include terms that penalize the differences between the positions of the candidate bounding boxes output by the neural network 308 and the ground truth bounding box positions in the input training image 302 . Many examples of such loss functions are known in the art, eg, L ₁ or L ₂ losses. The loss function may also include terms that penalize the output probabilities of the neural network 308 based on the actual presence/absence of objects at that location. Many examples of such loss functions are known in the art, eg, classification loss functions.

일부 실시예들에서, 손실은 객체들(예를 들어, 사람들)의 존재 시에 3D 표현 및 경계 박스 위치에만 불이익을 줄 수 있다 - 객체들의 부재 시에, 박스 및 3D 형상은 임의의 값들을 취할 수 있지만, 객체 검출기는 결과적인 가설들을 솎아낼 것이라는 점이 이해된다.In some embodiments, the loss may only penalize the 3D representation and bounding box position in the presence of objects (eg, people) - in the absence of objects, the box and 3D shape take on arbitrary values. However, it is understood that the object detector will thin out the resulting hypotheses.

파라미터 업데이트들(312)은 손실/목적 함수에 최적화 절차, 예컨대, 확률적 경사 하강법을 적용함으로써 결정될 수 있다. 손실 함수는 파라미터 업데이트들의 각각의 세트를 결정하기 전에 훈련 이미지들의 배치에 걸쳐 평균화될 수 있다. 프로세스는 훈련 데이터세트 내의 복수의 배치들에 걸쳐 반복될 수 있다.Parameter updates 312 may be determined by applying an optimization procedure, eg, stochastic gradient descent, to the loss/objective function. The loss function may be averaged over the batch of training images before determining each set of parameter updates. The process may be repeated across multiple batches in the training dataset.

메쉬 추정에 대한 부분 기반 접근법이 이용되는 실시예들에서, 감독 신호(306) 내의 3D 표현들 각각은 부분들로 분할될 수 있다. 각각의 감독 신호(306)의, 부분들로의 분할은 객체(304)의 키 포인트들에 기초할 수 있다. 예를 들어, 감독 신호(306)는 관절 위치들에 기초하여 신체 부분들(예를 들어, 손, 발 등)로 분할될 수 있다. 파라미터 업데이트들(312)을 결정할 때 신경망(308)에 의해 출력된 부분 메쉬(V[p])가 감독 신호(306) 내의 대응하는 부분 메쉬들과 비교된다.In embodiments where a part-based approach to mesh estimation is used, each of the 3D representations in the supervisor signal 306 may be partitioned into parts. The division of each supervisory signal 306 into portions may be based on key points of the object 304 . For example, the supervisory signal 306 may be segmented into body parts (eg, hands, feet, etc.) based on joint positions. When determining the parameter updates 312 the partial mesh V[p] output by the neural network 308 is compared with the corresponding partial meshes in the supervisory signal 306 .

도 4는 단일 2차원 이미지로부터 다수의 객체들을 갖는 3차원 장면을 재구성하는 데 이용하기 위한 신경망을 훈련시키기 위한 방법의 흐름도를 도시한다. 방법은 하나 이상의 위치에서 작동하는 하나 이상의 컴퓨터에 의해 수행될 수 있다. 방법은 도 3을 참조하여 설명된 훈련 방법에 대응할 수 있다.4 shows a flow diagram of a method for training a neural network for use in reconstructing a three-dimensional scene with multiple objects from a single two-dimensional image. The methods may be performed by one or more computers operating at one or more locations. The method may correspond to the training method described with reference to FIG. 3 .

작동(4.1)에서, 하나 이상의 2차원 이미지가 수신된다. 2D 이미지들은 도 1을 참조하여 설명된 이미지의 형태일 수 있다. 각각의 2D 이미지는 주어진 유형(예컨대, 인간)의 하나 이상의 객체를 포함한다.In operation 4.1, one or more two-dimensional images are received. The 2D images may be in the form of the images described with reference to FIG. 1 . Each 2D image includes one or more objects of a given type (eg, human).

작동(4.2)에서, 3차원 재구성을 위한 훈련 신호는 각각의 2차원 이미지에 대한 객체의 3차원 모델의 적응을 통해 획득된다. 이는, 이미지들 내의 객체들(예를 들어, 인간들)의 하나 이상의 3D 표현과 각각 연관된, 복수의 2D 이미지들을 포함하는 라벨링된 훈련 데이터세트를 효과적으로 생성한다.In operation (4.2), a training signal for three-dimensional reconstruction is obtained through adaptation of a three-dimensional model of an object to each two-dimensional image. This effectively creates a labeled training dataset comprising a plurality of 2D images, each associated with one or more 3D representations of objects (eg, humans) in the images.

3차원 모델을 피팅하는 것은 2차원 이미지 평면 상에 3차원 표현을 투영하여 투영된 표현을 생성하는 단계; 투영된 표현의 각각의 위치들을 단일 2차원 이미지의 객체와 비교하는 단계; 비교에 기초하여 에러 값을 측정하는 단계; 및 에러 값에 기초하여 융합된 3차원 표현의 파라미터들을 조정하는 단계 - 비교, 측정 및 조정 단계는 임계 조건이 만족될 때까지 반복적으로 반복됨 - 에 의해 수행될 수 있다. 임계 조건은 측정된 에러 값이, 미리 결정된 임계 값 아래로 떨어지는 것 및/또는 반복들의 임계 횟수가 초과되는 것일 수 있다. 투영하는 단계는, 원근 투영의 효과들을 고려함으로써, 그리고, 다수의 뷰들이 이용가능한 경우, 동일한 객체의 다수의 뷰들을 이용함으로써 수행될 수 있다.Fitting the three-dimensional model may include projecting the three-dimensional representation onto a two-dimensional image plane to produce a projected representation; comparing the respective positions of the projected representation with the object of the single two-dimensional image; determining an error value based on the comparison; and adjusting the parameters of the fused three-dimensional representation based on the error value, wherein the comparing, measuring and adjusting steps are iteratively repeated until a threshold condition is satisfied. The threshold condition may be that the measured error value falls below a predetermined threshold value and/or that a threshold number of iterations is exceeded. The projecting step may be performed by taking into account the effects of perspective projection, and, if multiple views are available, using multiple views of the same object.

작동(4.3)에서, 결과적인 3차원 모델 피팅 결과들은 심층 기계 학습 모델의 훈련을 위한 감독 신호들로서 이용된다. 심층 기계 학습 모델은 도 3과 관련하여 위에서 설명된 바와 같이 훈련될 수 있다.In operation (4.3), the resulting 3D model fitting results are used as supervisory signals for training of the deep machine learning model. The deep machine learning model may be trained as described above with respect to FIG. 3 .

도 5는 3차원 장면 재구성 및 객체 재식별에 이용하기 위한 신경망을 훈련시키기 위한 방법(500)의 추가 예를 도시한다. 방법은 하나 이상의 위치에서 작동하는 하나 이상의 컴퓨터에 의해 수행될 수 있다.5 shows a further example of a method 500 for training a neural network for use in three-dimensional scene reconstruction and object re-identification. The methods may be performed by one or more computers operating at one or more locations.

방법(500)은, 출력 데이터(510)의 추가적인 부분으로서 하나 이상의 재식별(REID) 임베딩 벡터(516a, 516b)를 출력하기 위해 신경망(508)의 분기의 훈련을 포함하는, 도 3과 관련하여 위에서 설명되고 도시된 방법의 확장이다. 이로써, 도 3과 관련하여 설명된 특징들 중 임의의 것이, 도 5와 관련하여 설명된 특징들과 추가적으로 결합될 수 있다. 도 5와 관련하여 설명된 방법들에 따라 훈련된 신경망(508)은 일련의 이미지들에 걸쳐 객체/사람 신원을 유지하는 능력을 부여받는다. 이는, 시간에 따른 정보의 축적 및 객체/사람 특정 파라미터 평활화뿐만 아니라 비디오 게임 경험들에 중요할 수 있는데, 여기서, 모든 이용자는 시간에 따라 고유한 캐릭터와 일관성있게 연관된다.3 , the method 500 includes training a branch of a neural network 508 to output one or more re-identification (REID) embedding vectors 516a , 516b as an additional portion of output data 510 . It is an extension of the method described and illustrated above. As such, any of the features described with respect to FIG. 3 may be further combined with the features described with respect to FIG. 5 . Neural network 508 trained according to the methods described in connection with FIG. 5 is endowed with the ability to maintain object/person identity across a series of images. This can be important for video game experiences, as well as the accumulation of information over time and smoothing of object/person specific parameters, where every user is consistently associated with a unique character over time.

사람/객체 재식별은 일반적으로, 문헌에서 2-스테이지 아키텍처들을 통해 다루어지는데, 여기서, 사람들/객체들은 먼저, 사람/객체 검출 시스템에 의해 검출되고, 모든 검출된 사람/객체에 대해, 이미지가 크로핑되어 별도의 사람 식별 망에 입력으로서 송신된다. 후자는 방해 파라미터들, 예컨대, 카메라 위치, 사람/객체 포즈, 조명 등에 불변하도록 훈련되는 특유의 사람/객체 시그니처처럼 작용하는 고차원 임베딩을 제공한다. 이러한 전략은 사람들의 명수에 있어서 선형으로 스케일링되는 복잡성을 갖고, 또한, 모바일 디바이스들 상에 2-스테이지로 구현하기 어렵다.Person/object re-identification is generally addressed via two-stage architectures in the literature, where people/objects are first detected by a person/object detection system, and for every detected person/object, the image is It is pinged and sent as input to a separate human identification network. The latter provides a higher-order embedding that acts like a unique person/object signature that is trained to be invariant in disturbance parameters such as camera position, person/object pose, lighting, etc. This strategy has a complexity that scales linearly in the number of people, and is also difficult to implement as a two-stage on mobile devices.

대신에, 도 1-4와 관련하여 설명된 신경망들은 교사-학생 망 증류 접근법을 이용하여 객체/사람 재식별(REID)의 작업을 수용하도록 확장될 수 있다. REID 임베딩들은 미리 훈련된 REID 망을 이용하여 객체/인간 크롭들로부터 추출된다. 이러한 REID 임베딩들은 신경망에 대한 REID 분기를 훈련시키기 위한 감독 신호로서 이용된다. 이 분기는 교육 망의 것들을 모방하는 REID 임베딩들을 전달하지만, 예를 들어, 완전 콘볼루션일 수 있는데, 이는 그의 실행 속도가 객체들의 개수와는 무관하다는 것을 의미한다.Instead, the neural networks described with respect to Figures 1-4 can be extended to accommodate the task of object/person re-identification (REID) using a teacher-student network distillation approach. REID embeddings are extracted from object/human crops using a pre-trained REID network. These REID embeddings are used as supervisory signals to train the REID branch for the neural network. This branch carries REID embeddings that mimic those of the training network, but can be, for example, fully convolutional, meaning that its execution speed is independent of the number of objects.

그러한 방식으로 훈련된 망을 재식별 및 추적 알고리즘과 조합하는 것은, 이용자가 장면으로부터 잠시 사라지는 경우들에도, 객체/사람 신원들이, 높은 사람 중첩, 상호작용들, 및 가림들을 갖는 장면들에서 유지되는 것을 허용한다. 그러한 재식별 및 추적 알고리즘들의 예들은 "Memory based online learning of deep representations from video streams"(에프. 퍼니시(F. Pernici) 등, CVPR 2018, Salt Lake City, UT, USA, 2018년 6월 18-22, 2324-2334 페이지)에서 발견될 수 있고, 그 내용은 이로써 참조로 포함된다. 장면 내의 다수의 이미지들에 걸친 객체/사람 식별의 유지는 상이한 가상 객체들/스킨들이 시간에 걸쳐 이미지 내의 각각의 객체/사람과 지속적으로 연관되는 것을 허용할 수 있다.Combining a network trained in such a way with a re-identification and tracking algorithm ensures that object/person identities are maintained in scenes with high person overlap, interactions, and occlusion, even when the user temporarily disappears from the scene. allow that Examples of such re-identification and tracking algorithms are "Memory based online learning of deep representations from video streams" (F. Pernici et al., CVPR 2018, Salt Lake City, UT, USA, 18 June 2018- 22, pages 2324-2334), the contents of which are hereby incorporated by reference. Maintaining object/person identification across multiple images in a scene may allow different virtual objects/skins to be consistently associated with each object/person in an image over time.

2D 훈련 이미지(502)는 복수의 2D 이미지들을 포함하는 훈련 데이터세트로부터 획득된다. 2D 이미지들은 주어진 유형(예를 들어, 인간들)의 하나 이상의 객체들(504a, 504b)(일반적으로 임의의 개수의 객체들이 존재할 수 있지만, 도시된 예에서는, 2개의 객체들(504a, 504b))을 포함한다. 3D 변형가능 모델은 단안 3D 재구성을 위한 하나 이상의 감독 신호(506)를 추출하기 위해 입력 이미지(502) 내의 객체들에 피팅(예를 들어, 반복적으로 피팅)된다. 추가적으로, 각각의 객체(504a, 504b)는 (예를 들어, 객체에 대한 실측 정보 경계 박스를 이용하여) 이미지로부터 크로핑되고, 미리 훈련된 REID 망(518)에 개별적으로 입력된다. 미리 훈련된 REID 망(518)은 각각의 입력 객체(504a, 504b)의 크롭을 처리하여, 입력 객체(504a, 504b)의 신원을 인코딩하는 "교사" 재식별 임베딩(520a, 520b)(입력 이미지(502) 내의 객체(i)에 대해 e^T _i로 표시됨)을 생성한다. 교사 재식별 임베딩들(e^T _i, 520a, 520b)은 신경망(508)의 재식별 분기를 훈련시키기 위한 감독 신호들로서 이용된다.The 2D training image 502 is obtained from a training dataset including a plurality of 2D images. The 2D images are one or more objects 504a, 504b of a given type (e.g., humans) (in general there may be any number of objects, but in the example shown, two objects 504a, 504b). includes The 3D deformable model is fitted (eg, iteratively fitted) to objects in the input image 502 to extract one or more supervisory signals 506 for monocular 3D reconstruction. Additionally, each object 504a, 504b is cropped from the image (eg, using a ground truth bounding box for the object) and individually input to a pre-trained REID network 518 . The pre-trained REID network 518 processes the cropping of each input object 504a, 504b to encode the "teacher" re-identification embedding 520a, 520b (input image) to encode the identity of the input object 504a, 504b. For object i in 502, denoted by e ^T _i ). ^The teacher re-identification embeddings e _Ti , 520a , 520b are used as supervisory signals to train the re-identification branch of the neural network 508 .

입력 훈련 이미지(502) 내의 각각의 (예상) 객체의 후보 3D 표현들(514a, 514b) 및 입력 훈련 이미지(502) 내의 각각의 (예상) 객체의 "학생" 재식별 임베딩(516a, 516b)을 포함하는 출력 데이터(510)를 생성하기 위해, 훈련 이미지(502)가 신경망(508)에 입력되고 신경망(508)의 현재 파라미터들을 이용하여 일련의 신경망 계층들을 통해 처리된다. 출력 데이터(510)는 신경망(508)에 대한 파라미터 업데이트들(512)을 결정하기 위해 교사 임베딩들(520a, 520b) 및 입력 이미지(502)의 감독 신호들(506)과 비교된다.Candidate 3D representations 514a, 514b of each (expected) object in the input training image 502 and "student" re-identification embeddings 516a, 516b of each (expected) object in the input training image 502 A training image 502 is input to a neural network 508 and processed through a series of neural network layers using the current parameters of the neural network 508 to generate output data 510 containing it. The output data 510 is compared with the supervisory signals 506 of the input image 502 and the teacher embeddings 520a , 520b to determine parameter updates 512 for the neural network 508 .

교사 REID 임베딩들(520a, 520b) 및 학생 REID 임베딩들(516a, 516b) 각각은 개별 객체의 신원을 나타내는 고차원 벡터이다. 예를 들어, 각각의 REID 임베딩은 N차원 벡터일 수 있다. N은, 예를 들어, 256, 512 또는 1024일 수 있다.Each of the teacher REID embeddings 520a, 520b and the student REID embeddings 516a, 516b is a high-dimensional vector representing the identity of an individual object. For example, each REID embedding may be an N-dimensional vector. N may be, for example, 256, 512 or 1024.

손실/목적 함수가 파라미터 업데이트들(512)을 결정하는데 이용되는 경우, 손실/목적 함수는 각각의 교사 REID 임베딩(520a, 520b)을 그의 대응하는 학생 REID 임베딩(516a, 516b)과 비교하는 재식별 손실을 포함할 수 있다. 재임베딩 손실은 각각의 교사 REID 임베딩(520a, 520b)과 그의 대응하는 학생 REID 임베딩(516a, 516b) 사이의 L₁ 또는 L₂ 손실일 수 있다.When the loss/objective function is used to determine the parameter updates 512, the loss/objective function re-identifies comparing each teacher REID embedding 520a, 520b to its corresponding student REID embedding 516a, 516b. may include losses. The re-embedding loss may be an L ₁ or L ₂ loss between each teacher REID embedding 520a, 520b and its corresponding student REID embedding 516a, 516b.

도 6은 본원에 설명된 방법들 중 임의의 방법을 수행하기 위한 시스템/장치의 개략적인 예를 도시한다. 도시된 시스템/장치는 컴퓨팅 디바이스의 예이다. 다른 유형들의 컴퓨팅 디바이스들/시스템들, 예컨대, 분산 컴퓨팅 시스템이, 본원에 설명된 방법들을 구현하기 위해 대안적으로 이용될 수 있다는 점이 통상의 기술자에 의해 이해될 것이다. 이러한 시스템들/장치들 중 하나 이상은 본원에 설명된 방법들을 수행하기 위해 이용될 수 있다. 예를 들어, 제1 컴퓨팅 디바이스(예를 들어, 모바일 컴퓨팅 디바이스)는 도 1 및 2와 관련하여 위에 설명된 방법을 수행하기 위해 이용될 수 있고/거나 제2 컴퓨팅 디바이스는 도 3 내지 5와 관련하여 위에 설명된 방법을 수행하기 위해 이용될 수 있다. 제1 계산 유닛에 이용되는 모델은 제2 계산 유닛을 이용하여 훈련되었을 수 있다.6 shows a schematic example of a system/apparatus for performing any of the methods described herein. The depicted system/apparatus is an example of a computing device. It will be appreciated by those of ordinary skill in the art that other types of computing devices/systems, such as a distributed computing system, may alternatively be used to implement the methods described herein. One or more of these systems/devices may be used to perform the methods described herein. For example, a first computing device (eg, a mobile computing device) may be used to perform the method described above with respect to FIGS. 1 and 2 and/or a second computing device may be used with respect to FIGS. 3-5 . Thus, it can be used to perform the method described above. The model used in the first computation unit may have been trained using the second computation unit.

장치(또는 시스템)(600)는 하나 이상의 프로세서(602)를 포함한다. 하나 이상의 프로세서는 시스템/장치(600)의 다른 구성요소들의 작동을 제어한다. 하나 이상의 프로세서(602)는, 예를 들어, 범용 프로세서를 포함할 수 있다. 하나 이상의 프로세서(602)는 단일 코어 디바이스 또는 다중 코어 디바이스일 수 있다. 하나 이상의 프로세서(602)는 중앙 처리 유닛(CPU) 또는 그래픽 처리 유닛(GPU)을 포함할 수 있다. 대안적으로, 하나 이상의 프로세서(602)는 특수 처리 하드웨어, 예를 들어, RISC 프로세서 또는 내장된 펌웨어를 갖는 프로그램가능 하드웨어를 포함할 수 있다. 다중 프로세서들이 포함될 수 있다.Device (or system) 600 includes one or more processors 602 . One or more processors control the operation of other components of system/device 600 . The one or more processors 602 may include, for example, general-purpose processors. The one or more processors 602 may be single-core devices or multi-core devices. The one or more processors 602 may include a central processing unit (CPU) or a graphics processing unit (GPU). Alternatively, the one or more processors 602 may include specialized processing hardware, eg, a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.

시스템/장치는 작업 또는 휘발성 메모리(604)를 포함한다. 하나 이상의 프로세서는 데이터를 처리하기 위해 휘발성 메모리(604)에 액세스할 수 있고 메모리 내의 데이터의 저장을 제어할 수 있다. 휘발성 메모리(604)는 임의의 유형의 RAM, 예를 들어, 정적 RAM(SRAM), 동적 RAM(DRAM)을 포함할 수 있거나, SD 카드와 같은 플래시 메모리를 포함할 수 있다.The system/device includes working or volatile memory 604 . One or more processors may access volatile memory 604 to process data and control storage of data within the memory. Volatile memory 604 may include any type of RAM, for example, static RAM (SRAM), dynamic RAM (DRAM), or may include flash memory such as an SD card.

시스템/장치는 비휘발성 메모리(606)를 포함한다. 비휘발성 메모리(606)는 프로세서들(602)의 작동을 제어하기 위한 작동 명령어들(608)의 세트를 컴퓨터 판독가능 명령어들의 형태로 저장한다. 비휘발성 메모리(606)는 임의의 종류의 메모리, 예컨대, 판독 전용 메모리(ROM), 플래시 메모리 또는 자기 드라이브 메모리일 수 있다.The system/device includes non-volatile memory 606 . The non-volatile memory 606 stores in the form of computer-readable instructions a set of operational instructions 608 for controlling the operation of the processors 602 . Non-volatile memory 606 may be any type of memory, such as read-only memory (ROM), flash memory, or magnetic drive memory.

하나 이상의 프로세서(602)는 시스템/장치로 하여금 본원에 설명된 방법들 중 임의의 것을 수행하게 하는 작동 명령어들(608)을 실행하도록 구성된다. 작동 명령어들(608)은 시스템/장치(600)의 하드웨어 구성요소들에 관한 코드(즉, 드라이버들)뿐만 아니라, 시스템/장치(600)의 기본 작동에 관한 코드도 포함할 수 있다. 일반적으로 말하면, 하나 이상의 프로세서(602)는 비휘발성 메모리(606)에 영구적으로 또는 반영구적으로 저장된 작동 명령어들(608)의 하나 이상의 명령어를 실행하고, 상기 작동 명령어들(608)의 실행 동안 생성된 데이터를 일시적으로 저장하기 위해 휘발성 메모리(604)를 이용한다.The one or more processors 602 are configured to execute operational instructions 608 that cause the system/device to perform any of the methods described herein. The operating instructions 608 may include code related to the basic operation of the system/device 600 as well as code related to hardware components (ie, drivers) of the system/device 600 . Generally speaking, the one or more processors 602 execute one or more instructions of operational instructions 608 stored permanently or semi-permanently in non-volatile memory 606 , generated during execution of the operational instructions 608 . Volatile memory 604 is used to temporarily store data.

본원에 설명된 방법들의 구현들은 디지털 전자 회로, 집적 회로, 특수 설계된 ASIC들(주문형 집적 회로들), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합들로와 같이 실현될 수 있다. 이들은, 도 6과 관련하여 설명된 것과 같이 컴퓨터에 의해 실행될 때 컴퓨터로 하여금 본원에 설명된 방법들 중 하나 이상을 수행하게 하는 컴퓨터 판독가능 명령어들을 포함하는 컴퓨터 프로그램 제품들(예를 들어, 자기 디스크들, 광 디스크들, 메모리, 프로그램가능 로직 디바이스들 상에 저장된 소프트웨어와 같은 것)을 포함할 수 있다.Implementations of the methods described herein may be realized as digital electronic circuits, integrated circuits, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These include computer program products (eg, magnetic disks) comprising computer readable instructions that, when executed by a computer, such as described in connection with FIG. 6 , cause a computer to perform one or more of the methods described herein. , optical disks, memory, software stored on programmable logic devices).

본원에 설명된 바와 같은 임의의 시스템 특징은 또한, 방법 특징으로서 제공될 수 있으며, 그 반대의 경우도 마찬가지이다. 본원에서 이용된 바와 같이, 기능식 기재 특징들은 이들의 대응 구조의 관점에서 대안적으로 표현될 수 있다. 특히, 방법 양상들은 시스템 양상들에 적용될 수 있고, 그 반대의 경우도 마찬가지이다.Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, functionally described features may alternatively be expressed in terms of their corresponding structures. In particular, method aspects may be applied to system aspects and vice versa.

또한, 일 양상에서의 임의의, 일부의 및/또는 모든 특징들은 임의의 다른 양상에서의 임의의, 일부의 및/또는 모든 특징들에 임의의 적절한 조합으로 적용될 수 있다. 본 발명의 임의의 양상들에서 설명되고 정의된 다양한 특징들의 특정 조합들이 독립적으로 구현되고/거나 공급되고/거나 이용될 수 있다는 것을 또한 이해해야 한다.Moreover, any, some and/or all features in one aspect may be applied to any, some and/or all features in any other aspect in any suitable combination. It should also be understood that certain combinations of the various features described and defined in any of the aspects of the invention may be independently implemented, supplied, and/or utilized.

몇몇 실시예들이 도시되고 설명되었지만, 관련 기술분야의 통상의 기술자는, 본 개시내용의 원리들로부터 벗어나지 않고 이러한 실시예들에 변경들이 이루어질 수 있다는 것을 이해할 것이며, 그 범위는 청구항들에서 정의된다.While several embodiments have been shown and described, it will be understood by those skilled in the art that changes may be made to these embodiments without departing from the principles of the disclosure, the scope of which is defined in the claims.

본원에 설명된 다양한 예시적인 실시예들은, 일 양상에서, 네트워킹된 환경들의 컴퓨터들에 의해 실행되는 컴퓨터 실행가능 명령어들, 예컨대, 프로그램 코드를 포함하는, 컴퓨터 판독가능 매체에 구체화된 컴퓨터 프로그램 제품에 의해 구현될 수 있는 방법 단계들 또는 프로세스들의 일반적인 맥락에서 설명된다. 컴퓨터 판독가능 매체는, 판독 전용 메모리(ROM), 랜덤 액세스 메모리(RAM), 컴팩트 디스크들(CD들), 디지털 다기능 디스크들(DVD) 등을 포함하지만 이에 제한되지 않는 이동식 및 비이동식 저장 디바이스들을 포함할 수 있다. 일반적으로, 프로그램 모듈들은 특정 작업들을 수행하거나 특정의 추상 데이터 유형들을 구현하는 루틴들, 프로그램들, 객체들, 구성요소들, 데이터 구조들 등을 포함할 수 있다. 컴퓨터 실행가능 명령어들, 연관된 데이터 구조들, 및 프로그램 모듈들은 본원에 개시된 방법들의 단계들을 실행하기 위한 프로그램 코드의 예들을 나타낸다. 특정한 일련의 그러한 실행가능한 명령어들 또는 연관된 데이터 구조들은 그러한 단계들 또는 프로세스들에 설명된 기능들을 구현하기 위한 대응하는 작용들의 예들을 나타낸다.The various illustrative embodiments described herein are, in one aspect, in a computer program product embodied in a computer-readable medium comprising computer-executable instructions, such as program code, being executed by computers in networked environments. described in the general context of method steps or processes that may be implemented by Computer-readable media includes removable and non-removable storage devices including, but not limited to, read-only memory (ROM), random access memory (RAM), compact disks (CDs), digital versatile disks (DVD), and the like. may include Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing the steps of the methods disclosed herein. A particular set of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

전술한 명세서에서, 실시예들은 구현마다 다를 수 있는 다수의 특정 세부사항들을 참조하여 설명되었다. 설명된 실시예들의 특정 적응들 및 수정들이 이루어질 수 있다. 다른 실시예들은 본 명세서 및 본원에 개시된 본 발명의 실시를 고려하여 관련 기술분야의 통상의 기술자에게 명백할 수 있다. 본 명세서 및 예들은 이하의 청구항들에 의해서 나타내진 본 발명의 진정한 범주 및 사상을 갖는 예시로서만 간주되는 것으로 의도된다. 또한, 도면들에 도시된 일련의 단계들은 단지 예시적인 목적들을 위한 것이고, 단계들의 임의의 특정 순서로 제한되도록 의도되지 않는다는 것이 의도된다.In the foregoing specification, embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. Certain adaptations and modifications of the described embodiments may be made. Other embodiments may be apparent to those skilled in the art in view of this specification and practice of the invention disclosed herein. This specification and examples are intended to be regarded only as examples having the true scope and spirit of the invention as indicated by the following claims. Also, it is intended that the series of steps shown in the figures are for illustrative purposes only, and are not intended to be limited to any particular order of steps.

이로써, 관련 기술분야의 통상의 기술자는, 이러한 단계들이, 동일한 방법을 구현하면서 상이한 순서로 수행될 수 있다는 것을 이해할 수 있다.As such, a person skilled in the art may understand that these steps may be performed in a different order while implementing the same method.

도면들 및 본 명세서에서, 예시적인 실시예들이 개시되었다. 그러나, 이러한 실시예들에 대해 많은 변형들 및 수정들이 이루어질 수 있다. 이에 따라, 특정 용어들이 채용되지만, 이들은 제한의 목적이 아니라 일반적이고 설명적인 의미로만 이용되며, 실시예들의 범위는 본원에 제시된 예시적인 실시예들에 의해 정의된다.In the drawings and herein, exemplary embodiments have been disclosed. However, many variations and modifications may be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the embodiments being defined by the exemplary embodiments presented herein.

Claims

A method for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image, comprising:
receiving a single two-dimensional image;
identifying all objects in the image to be reconstructed and identifying the types of objects;
estimating a three-dimensional representation of each identified object;
estimating a three-dimensional plane that physically supports all three-dimensional objects; and
and positioning all three-dimensional objects in space with respect to the support plane.

According to claim 1,
estimating the three-dimensional representation is performed in a deep machine learning model, the deep machine learning model comprising an output layer and one or more hidden layers each applying a nonlinear transformation to a received input to generate an output; method.

3. The method of claim 2,
The deep machine learning model predicts three-dimensional landmark positions of a plurality of objects by connecting feature data from one or more intermediate layers of the neural network, and the predicted three-dimensional positions are the predicted three-dimensional positions of an object depicted in each region. A method that is simultaneously estimated for a type.

According to claim 1,
and estimating a plane supporting the plurality of objects is performed for a single frame using the estimated three-dimensional positions of all visible objects to reconstruct a two-dimensional plane passing through all visible objects.

12. The method of claim 11,
wherein estimating a plane supporting the plurality of objects is performed for a series of frames using relative camera pose estimation and plane localization using correspondences between points of successive frames.

6. The method according to any one of claims 1 to 5,
The receiving step further comprises receiving a plurality of images, wherein the steps of estimating the three-dimensional representations of a plurality of objects and locating them are performed for each received image, for example in real time. , method.

7. The method of claim 6,
wherein the processing occurs over a plurality of consecutive frames by combining the hidden layer responses in successive frames, eg, by averaging them.

8. The method according to any one of claims 4 to 7,
digital graphic objects are synthetically added to the three-dimensional scene reconstruction with a given relationship to the estimated three-dimensional object positions and then projected back into the two-dimensional image.

A first computational unit for generating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image, comprising:
The first calculation unit includes a memory; and at least one processor, wherein the at least one processor is configured to perform the method of any one of claims 1 to 8.

A computer readable medium storing a set of instructions, comprising:
The set of instructions causes the first computation unit to perform a method according to any one of claims 1 to 8, to generate a three-dimensional reconstruction of a scene with a plurality of objects from a single two-dimensional image. A computer-readable medium executable by at least one processor of the first computing unit.

A computer program product comprising:
A computer program product comprising instructions that, when the program is executed by a computer, cause the computer to perform the method of any one of claims 1 to 8.

A method for training a deep machine learning model to generate a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image, the method comprising:
receiving a single two-dimensional image;
obtaining a training signal for three-dimensional reconstruction through adapting a three-dimensional model of an object to the two-dimensional image, and using the resultant three-dimensional model fitting results as a supervisory signal for training the deep machine learning model , method.

13. The method of claim 12,
The fitting of the three-dimensional model may include generating a projected representation by projecting the three-dimensional representation onto a two-dimensional image plane;
comparing the respective positions of the projected representation to the object in the single two-dimensional image;
determining an error value based on the comparison; and
adjusting parameters of the fused three-dimensional representation based on the error value, wherein the comparing, measuring and adjusting steps are performed until the measured error value is below a predetermined threshold value or a threshold number of iterations is exceeded. iteratively repeated - carried out by a method.

14. The method of claim 13,
wherein the projecting is performed by taking into account effects of perspective projection and, if multiple views are available, using multiple views of the same object.

A second computational unit for generating a three-dimensional reconstruction from a single two-dimensional image to estimate a three-dimensional representation of an object contained in the single two-dimensional image, comprising:
The second calculation unit may include: a memory; and at least one processor, wherein the at least one processor is configured to perform the method of any one of claims 12 to 14.

A computer readable medium storing a set of instructions, comprising:
15. The set of instructions comprises: to cause a second computation unit to perform a method according to any one of claims 12 to 14, to update a fused three-dimensional representation of an object contained in a single two-dimensional image. A computer-readable medium executable by at least one processor of a second computation unit.

A computer program product comprising:
15. A computer program product comprising instructions that, when the program is executed by a computer, cause the computer to perform the method of any one of claims 12 to 14.

A system for providing a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image, comprising:
a first calculation unit according to claim 9; and
A second calculation unit according to claim 15, comprising:
and the first calculation unit is trained with the results of the second calculation unit.