KR20240076248A

KR20240076248A - Apparatus for pose estimation and method thereof

Info

Publication number: KR20240076248A
Application number: KR1020220158586A
Authority: KR
Inventors: 정성훈
Original assignee: 현대자동차주식회사; 기아 주식회사
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2024-05-30

Abstract

본 발명의 일 실시 예에 따른 자세 추정 장치는, 카메라를 이용하여 획득한 영상으로부터 외형 표현자를 획득하는 제1 심층 신경망, 상기 외형 표현자를 이용하여 상기 영상에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 신체 부위 위치 데이터를 획득하는 제2 심층 신경망, 상기 외형 표현자를 이용하여 영상 픽셀 위치 정보를 획득하는 제3 심층 신경망, 및 상기 외형 표현자, 상기 신체 부위 위치 데이터, 상기 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 획득하는 제4 심층 신경망을 포함할 수 있다.A posture estimation device according to an embodiment of the present invention includes a first deep neural network that acquires an appearance descriptor from an image acquired using a camera, and a first deep neural network that uses the appearance descriptor to identify at least one body part of a person included in the image. A second deep neural network for acquiring body part location data regarding location, a third deep neural network for obtaining image pixel location information using the appearance descriptor, and the appearance descriptor, the body part location data, and the image pixel location information. It may include a fourth deep neural network that obtains a pose estimation descriptor using at least one of , or any combination thereof.

Description

Posture estimation device and method {APPARATUS FOR POSE ESTIMATION AND METHOD THEREOF}

본 발명은 자세 추정 장치 및 그 방법에 관한 것으로, 보다 상세하게는 인공 신경망을 통해 영상에 포함된 적어도 하나의 사람의 자세를 추정하는 기술에 관한 것이다.The present invention relates to a posture estimation device and method, and more specifically, to a technology for estimating the posture of at least one person included in an image through an artificial neural network.

영상에 포함된 사람의 자세를 추정하는 방법이 다양한 기술 분야에서 요구되고 있다. 예를 들어, 차량의 안전한 주행을 위해 차량에 인접한 영역에 존재하는 사람의 상태를 식별할 필요가 있다. 따라서, 차량에 포함된 카메라를 이용하여 인접한 영역의 영상을 획득한 후 다양한 방법을 통해 분석하여 영상 내의 사람에 관한 정보를 추출하는 방법이 요구될 수 있다.Methods for estimating the posture of a person included in a video are required in various technical fields. For example, in order to drive a vehicle safely, it is necessary to identify the status of people present in an area adjacent to the vehicle. Therefore, a method may be required to acquire images of adjacent areas using a camera included in a vehicle and then analyze them through various methods to extract information about people in the images.

특히, 운전자가 개입하지 않는 상황에서 차량이 자율 주행을 수행하는 경우, 차량에 접근하는 사람의 존재를 정확하고 신속하게 식별함으로써 안전한 주행 환경을 마련할 수 있다.In particular, when a vehicle performs autonomous driving in a situation where the driver does not intervene, a safe driving environment can be created by accurately and quickly identifying the presence of people approaching the vehicle.

예를 들어, 영상에 포함된 사람의 적어도 하나의 신체 부위를 각각 추정(예: body part candidate detection)한 후, 신체 부위를 그룹화하여 사람을 식별(예: body part grouping and/or association)하고 사람의 자세를 추정하는 방법이 개발되고 있다. 신체 부위의 그룹화(grouping)를 위하여, 신체 부위로 추정된 부분의 외형 특징(예: appearance 및/또는 texture feature) 정보가 이용될 수 있다.For example, after estimating at least one body part of a person included in an image (e.g., body part candidate detection), the body parts are grouped to identify the person (e.g., body part grouping and/or association), and the person is identified. A method for estimating one's posture is being developed. For grouping of body parts, external feature (e.g., appearance and/or texture feature) information of the estimated body part may be used.

그러나, 외형 특징을 이용하여 신체 부위를 추정하거나 그룹화하는 종래 기술은 사람의 자세 추정에 대한 정확도가 떨어지는 문제가 있다. 예를 들어, 영상 내의 복수의 사람들의 외형 특징이 명확하게 구분되지 않는 경우, 서로 다른 사람의 신체 부위가 하나로 그룹화 되는 문제가 발생할 수 있다. 또한, 영상에 복수의 사람들이 겹쳐진 채 포함되어 있는 경우, 2D 화면으로 표현되는 영상의 특성 상 복수의 사람들 각각에 대한 정확한 신체 부위 추정 및/또는 자세 추정이 어려울 수 있다.However, conventional techniques for estimating or grouping body parts using external features have the problem of low accuracy in estimating a person's posture. For example, if the external characteristics of multiple people in an image are not clearly distinguished, a problem may occur in which body parts of different people are grouped into one. Additionally, when an image includes multiple people overlapping, it may be difficult to accurately estimate body parts and/or poses for each of the multiple people due to the nature of the image displayed on a 2D screen.

본 발명의 일 실시 예는, 적어도 하나의 심층 신경망(예: CNN(convolutional neural networks))을 이용하여 영상으로부터 표현자(representation)를 추출하고, 추출된 외형 표현자를 이용하여 복수의 사람들 각각의 신체 부위 및 자세를 정확하게 추정하는 자세 추정 장치 및 그 방법을 제공하고자 한다.One embodiment of the present invention extracts a representation from an image using at least one deep neural network (e.g., convolutional neural networks (CNN)), and uses the extracted appearance presenter to describe the body of each of a plurality of people. The goal is to provide a posture estimation device and method that accurately estimates body parts and posture.

본 발명의 일 실시 예는, 자세 추정 장치(또는, 자세 추정 장치에 포함된 카메라)로부터 영상에 포함된 사람들까지의 이격 거리를 이용하여 생성한 3차원 공간 데이터를 이용함으로써, 영상 데이터 자체만을 이용하는 방법의 한계를 극복할 수 있는 자세 추정 장치 및 그 방법을 제공하고자 한다.One embodiment of the present invention uses only the image data itself by using 3D spatial data generated using the separation distance from the posture estimation device (or the camera included in the posture estimation device) to the people included in the image. The aim is to provide a posture estimation device and method that can overcome the limitations of the method.

본 발명의 일 실시 예는, 영상에 포함된 사람들 각각에 대응하는 박스(box)의 높이 정보를 이용하여 획득한 사람들 각각의 위치 정보를 더 이용함으로써 영상 데이터 자체만을 이용하는 방법의 한계를 극복할 수 있는 자세 추정 장치 및 그 방법을 제공하고자 한다.One embodiment of the present invention can overcome the limitations of the method of using only the video data itself by further using the location information of each person obtained using the height information of the box corresponding to each person included in the video. The purpose is to provide a posture estimation device and method.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재들로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

본 발명의 일 실시 예에 따른 자세 추정 장치는, 카메라를 이용하여 획득한 영상으로부터 외형 표현자(representation)를 획득하는 제1 심층 신경망, 상기 외형 표현자를 이용하여 상기 영상에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 신체 부위 위치 데이터를 획득하는 제2 심층 신경망, 상기 외형 표현자를 이용하여 영상 픽셀 위치 정보를 획득하는 제3 심층 신경망, 및 상기 외형 표현자, 상기 신체 부위 위치 데이터, 상기 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 획득하는 제4 심층 신경망을 포함할 수 있다.A posture estimation device according to an embodiment of the present invention includes a first deep neural network that acquires an appearance presenter from an image acquired using a camera, and at least one person included in the image using the appearance presenter. a second deep neural network for acquiring body part location data regarding the location of the body part, a third deep neural network for obtaining image pixel location information using the appearance descriptor, and the appearance descriptor, the body part location data, It may include a fourth deep neural network that obtains a pose estimation descriptor using at least one of image pixel position information or any combination thereof.

일 실시 예에 따르면, 상기 자세 추정 장치는, 상기 제4 심층 신경망으로부터 획득된 상기 자세 추정 표현자에 유사도 기반 그룹화 알고리즘을 적용하여 자세 추정 표현자를 적어도 하나의 그룹으로 그룹화하고, 상기 적어도 하나의 그룹 각각을 상기 적어도 하나의 사람 각각의 자세 추정 결과로 출력하도록 구성될 수 있다.According to one embodiment, the pose estimation apparatus groups the pose estimation descriptors into at least one group by applying a similarity-based grouping algorithm to the pose estimation descriptors obtained from the fourth deep neural network, and groups the pose estimation descriptors into at least one group. It may be configured to output each as a posture estimation result for each of the at least one person.

일 실시 예에 따르면, 상기 제1 심층 신경망은, 백본(backbone) 네트워크를 통해 상기 영상으로부터 특징점(feature)을 추출하여 외형 표현자를 획득하도록 구성될 수 있다.According to one embodiment, the first deep neural network may be configured to obtain an appearance descriptor by extracting features from the image through a backbone network.

일 실시 예에 따르면, 상기 제2 심층 신경망은, 상기 외형 표현자를 이용하여 상기 신체 부위의 개수만큼의 확률 맵을 생성하고, 상기 확률 맵에 포함된 복수의 픽셀 위치들 중 확률 값이 지정된 값 이상인 픽셀 위치를 이용하여 상기 신체 부위 위치 데이터를 획득하도록 구성될 수 있다.According to one embodiment, the second deep neural network generates a probability map corresponding to the number of body parts using the appearance descriptor, and the probability value among a plurality of pixel positions included in the probability map is greater than or equal to a specified value. and may be configured to obtain the body part location data using pixel location.

일 실시 예에 따르면, 상기 제3 심층 신경망은, 상기 카메라가 존재하는 위치로부터, 상기 영상 내에 포함된 상기 적어도 하나의 사람까지의 거리 정보를 포함하는 3차원 거리 정보를 더 이용하여, 상기 영상 픽셀 위치 정보를 획득하도록 구성될 수 있다.According to one embodiment, the third deep neural network further uses three-dimensional distance information including distance information from the location where the camera exists to the at least one person included in the image, to determine the image pixel. It may be configured to obtain location information.

일 실시 예에 따르면, 상기 제3 심층 신경망은, 상기 영상 내에서 상기 적어도 하나의 사람의 상기 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 상기 카메라로부터의 거리 정보에 따라 결정된 색상으로 표시된 상기 3차원 거리 정보를 이용하여, 상기 영상 픽셀 위치 정보를 획득하도록 구성될 수 있다.According to one embodiment, the third deep neural network displays pixels corresponding to the body part of the at least one person in the image in a color determined according to distance information from the camera among a plurality of colors. It may be configured to obtain the image pixel location information using the 3D distance information.

일 실시 예에 따르면, 상기 제3 심층 신경망은, 상기 영상 내에 포함된 상기 적어도 하나의 사람 각각에 대응하는 박스(box)의 높이 정보를 더 이용하여, 상기 영상 픽셀 위치 정보를 획득하도록 구성될 수 있다.According to one embodiment, the third deep neural network may be configured to obtain the image pixel location information by further using height information of a box corresponding to each of the at least one person included in the image. there is.

일 실시 예에 따르면, 상기 제4 심층 신경망은, 상기 외형 표현자, 상기 신체 부위 위치 데이터, 상기 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 연결(concatenation)한 입력 데이터를 이용하여, 상기 자세 추정 표현자를 출력하도록 구성될 수 있다.According to one embodiment, the fourth deep neural network uses input data concatenating at least one of the appearance descriptor, the body part location data, the image pixel location information, or any combination thereof, It may be configured to output the pose estimation descriptor.

일 실시 예에 따르면, 상기 신체 부위 위치 데이터는, 상기 영상 내에 포함된 상기 적어도 하나의 사람의 상기 신체 부위 각각에 대한 좌표에 대응하는 적어도 하나의 픽셀 정보를 포함할 수 있다.According to one embodiment, the body part location data may include at least one pixel information corresponding to coordinates for each body part of the at least one person included in the image.

일 실시 예에 따르면, 상기 제2 심층 신경망, 상기 제3 심층 신경망, 상기 제4 심층 신경망, 또는 이들의 어느 조합 중의 적어도 하나는 CNN(convolutional neural networks)을 포함할 수 있다.According to one embodiment, at least one of the second deep neural network, the third deep neural network, the fourth deep neural network, or any combination thereof may include convolutional neural networks (CNN).

본 발명의 일 실시 예에 따른 자세 추정 방법은, 일 실시 예에 따르면, 제1 심층 신경망이, 카메라를 이용하여 획득한 영상으로부터 외형 표현자(representation)를 획득하는 단계, 제2 심층 신경망이, 상기 외형 표현자를 이용하여 상기 영상에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 신체 부위 위치 데이터를 획득하는 단계, 제3 심층 신경망이, 상기 외형 표현자를 이용하여 영상 픽셀 위치 정보를 획득하는 단계, 및 제4 심층 신경망이, 상기 외형 표현자, 상기 신체 부위 위치 데이터, 상기 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 획득하는 단계를 포함할 수 있다.The pose estimation method according to an embodiment of the present invention includes the steps of a first deep neural network acquiring an appearance presenter from an image acquired using a camera, and a second deep neural network comprising: Obtaining body part location data regarding the location of at least one body part of a person included in the image using the appearance descriptor, wherein a third deep neural network acquires image pixel location information using the appearance descriptor. A step, and a fourth deep neural network may include obtaining a pose estimation descriptor using at least one of the appearance descriptor, the body part location data, the image pixel location information, or any combination thereof.

일 실시 예에 따르면, 상기 제4 심층 신경망이 상기 자세 표현자를 획득하는 단계는, 자세 추정 장치가, 상기 자세 추정 표현자에 유사도 기반 그룹화 알고리즘을 적용하여 자세 추정 표현자를 적어도 하나의 그룹으로 그룹화하는 단계 및 상기 자세 추정 장치가, 상기 적어도 하나의 그룹 각각을 상기 적어도 하나의 사람 각각의 자세 추정 결과로 출력하는 단계를 포함할 수 있다.According to one embodiment, the step of the fourth deep neural network acquiring the pose descriptor includes the pose estimation device grouping the pose estimate descriptors into at least one group by applying a similarity-based grouping algorithm to the pose estimate descriptors. and outputting, by the posture estimation device, each of the at least one group as a posture estimation result for each of the at least one person.

일 실시 예에 따르면, 상기 제1 심층 신경망이 상기 외형 표현자를 획득하는 단계는, 상기 제1 심층 신경망이, 백본(backbone) 네트워크를 통해 상기 영상으로부터 특징점(feature)을 추출하여 외형 표현자를 획득하는 단계를 포함할 수 있다.According to one embodiment, the step of the first deep neural network acquiring the appearance descriptor includes the first deep neural network extracting a feature from the image through a backbone network to obtain the appearance descriptor. Steps may be included.

일 실시 예에 따르면, 상기 제2 심층 신경망이 상기 신체 부위 위치 데이터를 획득하는 단계는, 상기 제2 심층 신경망이, 상기 외형 표현자를 이용하여 상기 신체 부위의 개수만큼의 확률 맵을 생성하고, 상기 확률 맵에 포함된 복수의 픽셀 위치들 중 확률 값이 지정된 값 이상인 픽셀 위치를 이용하여 상기 신체 부위 위치 데이터를 획득하는 단계를 포함할 수 있다.According to one embodiment, the step of the second deep neural network acquiring the body part location data includes: the second deep neural network generates a probability map corresponding to the number of body parts using the appearance descriptor, It may include obtaining the body part location data using a pixel location whose probability value is greater than or equal to a specified value among a plurality of pixel locations included in the probability map.

일 실시 예에 따르면, 상기 제3 심층 신경망이 상기 영상 픽셀 위치 정보를 획득하는 단계는, 상기 제3 심층 신경망이, 상기 카메라가 존재하는 위치로부터, 상기 영상 내에 포함된 상기 적어도 하나의 사람까지의 거리 정보를 포함하는 3차원 거리 정보를 더 이용하여, 상기 영상 픽셀 위치 정보를 획득하는 단계를 포함할 수 있다.According to one embodiment, the step of the third deep neural network acquiring the image pixel location information includes the third deep neural network measuring the distance from the location where the camera exists to the at least one person included in the image. The method may further include obtaining the image pixel location information using 3D distance information including distance information.

본 발명에 따른 자세 추정 장치 및 그 방법의 효과에 대해 설명하면 다음과 같다.The effects of the posture estimation device and method according to the present invention will be described as follows.

본 발명의 실시 예들 중 적어도 하나에 의하면, 영상에 포함된 적어도 하나의 사용자의 자세를 추정하기 위하여 적어도 하나의 사용자 각각에 대한 3차원 정보(예: 거리 및/또는 위치에 관한 정보) 정보를 더 이용함으로써, 보다 정확한 자세 추정 결과를 생성할 수 있다.According to at least one of the embodiments of the present invention, in order to estimate the posture of at least one user included in the image, 3D information (e.g., information about distance and/or location) for each at least one user is further added. By using this, more accurate posture estimation results can be generated.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 복수의 사용자들이 서로 겹쳐진 상태로 영상에 포함되어 있는 경우, 서로 다른 사용자의 신체 부위를 명확하게 구분하여 처리함으로써 정확한 자세 추정을 수행할 수 있다.Additionally, according to at least one of the embodiments of the present invention, when a plurality of users are included in an image overlapping each other, accurate posture estimation can be performed by clearly distinguishing and processing body parts of different users.

이 외에, 본 문서를 통해 직접적 또는 간접적으로 파악되는 다양한 효과들이 제공될 수 있다.In addition, various effects that can be directly or indirectly identified through this document may be provided.

도 1은 본 발명의 일 실시 예에 따른 자세 추정 장치의 구성 요소들을 나타낸 블록도이다.
도 2는 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작을 나타낸 개념도이다.
도 3은 본 발명의 일 실시 예에 따른 영상 픽셀 위치 정보를 획득하는 과정을 도시한 개념도이다.
도 4는 본 발명의 일 실시 예에 따른 자세 추정 장치가 획득하거나 생성한 데이터의 예시를 나타낸다.
도 5는 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작을 나타낸 개념도이다.
도 6은 본 발명의 일 실시 예에 따른 영상 픽셀 위치 정보를 획득하는 방법을 나타낸 개념도이다.
도 7은 본 발명의 일 실시 예에 따른 자세 추정 장치가 생성한 자세 추정 결과 데이터의 예시를 나타낸다.
도 8은 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작 순서도이다.
도 9는 본 발명의 일 실시 예에 따른 자세 추정 방법에 관한 컴퓨팅 시스템을 도시한다.
도면의 설명과 관련하여, 동일 또는 유사한 구성요소에 대해서는 동일 또는 유사한 참조 부호가 사용될 수 있다.Figure 1 is a block diagram showing components of a posture estimation device according to an embodiment of the present invention.
Figure 2 is a conceptual diagram showing the operation of a posture estimation device according to an embodiment of the present invention.
Figure 3 is a conceptual diagram illustrating a process for obtaining image pixel location information according to an embodiment of the present invention.
Figure 4 shows an example of data acquired or generated by a posture estimation device according to an embodiment of the present invention.
Figure 5 is a conceptual diagram showing the operation of a posture estimation device according to an embodiment of the present invention.
Figure 6 is a conceptual diagram showing a method of obtaining image pixel location information according to an embodiment of the present invention.
Figure 7 shows an example of posture estimation result data generated by a posture estimation device according to an embodiment of the present invention.
Figure 8 is an operation flowchart of a posture estimation device according to an embodiment of the present invention.
Figure 9 shows a computing system for a posture estimation method according to an embodiment of the present invention.
In relation to the description of the drawings, identical or similar reference numerals may be used for identical or similar components.

이하, 본 발명의 일부 실시 예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시 예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail through illustrative drawings. When adding reference numerals to components in each drawing, it should be noted that identical components are given the same reference numerals as much as possible even if they are shown in different drawings. Additionally, when describing embodiments of the present invention, if detailed descriptions of related known configurations or functions are judged to impede understanding of the embodiments of the present invention, the detailed descriptions will be omitted.

본 발명의 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In describing the components of the embodiments of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, sequence, or order of the component is not limited by the term. Additionally, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted in an ideal or excessively formal sense unless explicitly defined in the present application. No.

이하, 도 1 내지 도 9를 참조하여, 본 발명의 실시 예들을 구체적으로 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to FIGS. 1 to 9.

도 1은 본 발명의 일 실시 예에 따른 자세 추정 장치(100)의 구성 요소들을 나타낸 블록도이다.Figure 1 is a block diagram showing the components of a posture estimation device 100 according to an embodiment of the present invention.

일 실시 예에 따르면, 자세 추정 장치(100)는 적어도 하나의 심층 신경망을 포함할 수 있다. 예를 들어, 적어도 하나의 심층 신경망은 복수의 레이어를 가지는 인공지능 신경망(neural network)을 포함할 수 있다. 예를 들어, 자세 추정 장치(100)는 제1 심층 신경망(110), 제2 심층 신경망(120), 제3 심층 신경망(130), 및 제4 심층 신경망(140)을 포함할 수 있다.According to one embodiment, the posture estimation device 100 may include at least one deep neural network. For example, at least one deep neural network may include an artificial intelligence neural network having multiple layers. For example, the posture estimation device 100 may include a first deep neural network 110, a second deep neural network 120, a third deep neural network 130, and a fourth deep neural network 140.

예를 들어, 심층 신경망들 중 적어도 일부(예: 제2 심층 신경망(120), 제3 심층 신경망(130), 및 제4 심층 신경망(140))은 CNN(convolutional neural networks)을 포함할 수 있다. For example, at least some of the deep neural networks (e.g., the second deep neural network 120, the third deep neural network 130, and the fourth deep neural network 140) may include convolutional neural networks (CNNs). .

도 1에 도시된 자세 추정 장치(100)의 구성은 예시적인 것으로써, 본 발명의 실시 예들이 이에 제한되는 것은 아니다. The configuration of the posture estimation device 100 shown in FIG. 1 is illustrative, and embodiments of the present invention are not limited thereto.

예를 들어, 제1 내지 제4 심층 신경망(110, 120, 130, 및 140)들은 구분되는 별개의 구성으로 도시되어 있으나, 제1 내지 제4 심층 신경망(110, 120, 130, 및 140)들 중 적어도 일부는 하나의 심층 신경망으로 구현될 수도 있다. For example, the first to fourth deep neural networks 110, 120, 130, and 140 are shown as separate configurations, but the first to fourth deep neural networks 110, 120, 130, and 140 At least some of them may be implemented as a deep neural network.

예를 들어, 자세 추정 장치(100)는 도 1에 도시되지 않은 구성을 포함할 수 있다. 일 예로, 자세 추정 장치(100)는 영상(또는, 적어도 하나의 이미지)을 획득하는 카메라를 포함할 수 있다.For example, the posture estimation device 100 may include components not shown in FIG. 1 . As an example, the posture estimation device 100 may include a camera that acquires an image (or at least one image).

일 실시 예에 따르면, 제1 심층 신경망(110)은 영상으로부터 적어도 하나의 외형 표현자(representation)를 추출할 수 있다. According to one embodiment, the first deep neural network 110 may extract at least one external representation from an image.

예를 들어, 제1 심층 신경망(110)은 백본(backbone) 네트워크를 통해 상기 영상으로부터 특징점(feature)을 추출하여 외형 표현자를 획득할 수 있다.For example, the first deep neural network 110 may obtain an appearance descriptor by extracting features from the image through a backbone network.

예를 들어, 제1 심층 신경망(110)에 카메라를 이용하여 획득된 영상을 입력 데이터로써 입력하면, 제1 심층 신경망(110)은 영상에 포함된 적어도 하나의 사람 각각에 대응하는 적어도 하나의 외형 표현자를 획득하고, 획득된 외형 표현자를 출력 데이터로써 출력할 수 있다.For example, when an image acquired using a camera is input to the first deep neural network 110 as input data, the first deep neural network 110 generates at least one appearance corresponding to each of at least one person included in the image. A descriptor can be acquired, and the obtained external descriptor can be output as output data.

예를 들어, 외형 표현자는, 영상 내에 포함된 적어도 하나의 사람 각각의 적어도 하나의 외형 정보(예: 위치)를 구분하여 식별 가능하도록 하는 데이터를 포함할 수 있다. For example, the appearance descriptor may include data that enables identification by distinguishing at least one piece of appearance information (e.g., location) of each of at least one person included in the image.

일 실시 예에 따르면, 제2 심층 신경망(120)은 외형 표현자를 이용하여 신체 부위 위치 데이터를 획득할 수 있다.According to one embodiment, the second deep neural network 120 may obtain body part location data using an appearance descriptor.

예를 들어, 신체 부위 위치 데이터는 영상 내에 포함된 적어도 하나의 사람의 신체 부위 각각에 대한 좌표에 대응하는 적어도 하나의 픽셀 정보를 포함할 수 있다.For example, body part location data may include at least one pixel information corresponding to coordinates for each body part of at least one person included in the image.

예를 들어, 제2 심층 신경망(120)은 외형 표현자를 이용하여 상기 영상에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 신체 부위 위치 데이터를 획득할 수 있다.For example, the second deep neural network 120 may use an appearance descriptor to obtain body part location data regarding the location of at least one body part of a person included in the image.

예를 들어, 제2 심층 신경망(120)에 외형 표현자를 입력 데이터로써 입력하면, 제2 심층 신경망(120)은 영상에 포함된 적어도 하나의 사람 적어도 하나의 신체 부위(예: 머리, 손, 팔, 팔꿈치, 또는 이들의 어느 조합 중의 적어도 하나) 각각의 위치를 표현하는 신체 부위 위치 데이터를 획득하고, 획득된 신체 부위 위치 데이터를 출력 데이터로써 출력할 수 있다.For example, when an appearance descriptor is input to the second deep neural network 120 as input data, the second deep neural network 120 determines at least one body part (e.g., head, hand, arm) of at least one person included in the image. , elbow, or any combination thereof), body part position data representing each position may be acquired, and the obtained body part position data may be output as output data.

예를 들어, 제2 심층 신경망(120)은 외형 표현자를 이용하여 신체 부위의 개수만큼의 확률 맵을 생성하고, 확률 맵에 포함된 복수의 픽셀 위치들 중 확률 값(예: 0 내지 1)이 지정된 값(예: 0.5) 이상인 픽셀 위치를 이용하여 상기 신체 부위 위치 데이터를 획득할 수 있다. For example, the second deep neural network 120 uses an appearance descriptor to generate a probability map corresponding to the number of body parts, and the probability value (e.g., 0 to 1) among a plurality of pixel positions included in the probability map is The body part location data can be obtained using a pixel location that is greater than or equal to a specified value (eg, 0.5).

일 예로, 제2 심층 신경망(120)은 확률 맵에 포함된 복수의 픽셀 위치들 중에서, 확률 값이 지정된 값 이상인 픽셀 위치(또는, 좌표)만을 이용할 수 있다. As an example, the second deep neural network 120 may use only pixel positions (or coordinates) whose probability value is greater than or equal to a specified value among a plurality of pixel positions included in the probability map.

일 예로, 제2 심층 신경망(120)은 확률 맵에 포함된 복수의 픽셀 위치들 중 이용되지 않는 나머지 픽셀 위치들은 음영(또는, 블러(blur)) 처리 할 수 있다. As an example, the second deep neural network 120 may shade (or blur) remaining unused pixel positions among a plurality of pixel positions included in the probability map.

일 실시 예에 따르면, 제3 심층 신경망(130)은 외형 표현자를 이용하여 영상 픽셀 위치 정보를 획득할 수 있다.According to one embodiment, the third deep neural network 130 may obtain image pixel location information using an appearance descriptor.

예를 들어, 제3 심층 신경망(130)에 외형 표현자, 3차원 거리 정보, 또는 이들의 어느 조합 중의 적어도 하나를 입력 데이터로써 입력하면, 제3 심층 신경망(130)은 영상에 포함된 적어도 하나의 사람의 3차원 위치 정보를 포함하는 영상 픽셀 위치 정보를 획득하고, 획득된 영상 픽셀 위치 정보를 출력 데이터로써 출력할 수 있다.For example, when at least one of an appearance descriptor, 3D distance information, or any combination thereof is input to the third deep neural network 130 as input data, the third deep neural network 130 determines at least one of the image descriptors included in the image. Image pixel location information including 3D location information of the person may be acquired, and the acquired image pixel location information may be output as output data.

예를 들어, 제3 심층 신경망(130)은 카메라가 존재하는 위치로부터, 영상 내에 포함된 적어도 하나의 사람까지의 거리 정보를 포함하는 3차원 거리 정보를 더 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다.For example, the third deep neural network 130 further uses 3D distance information including distance information from the location where the camera exists to at least one person included in the image to obtain image pixel location information. You can.

일 예로, 3차원 거리 정보는, 영상 내에 신체 부위가 존재하는 픽셀의 일 영역을 지정된 색상으로 표시한 정보를 포함할 수 있다.As an example, 3D distance information may include information indicating a region of a pixel where a body part is present in an image in a designated color.

일 예로, 제3 심층 신경망(130)은 영상 내에서 적어도 하나의 사람의 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 카메라로부터의 거리 정보에 따라 결정된 색상으로 표시된 3차원 거리 정보를 더 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다.As an example, the third deep neural network 130 further provides 3D distance information displayed in a color determined according to distance information from the camera among a plurality of colors for a pixel corresponding to at least one body part of a person in the image. Using this, image pixel location information can be obtained.

예를 들어, 제3 심층 신경망(130)은 영상 내에 포함된 적어도 하나의 사람 각각에 대응하는 박스(box)의 높이 정보를 포함하는 3차원 거리 정보를 더 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다.For example, the third deep neural network 130 further uses 3D distance information including height information of a box corresponding to each of at least one person included in the image to obtain image pixel location information. You can.

일 예로, 3차원 거리 정보는, 영상 내에 포함된 사람을 둘러싸는 박스의 높이 정보를 포함할 수 있다.As an example, 3D distance information may include height information of a box surrounding a person included in an image.

일 예로, 제3 심층 신경망(130)은 영상 내에 포함된 사람을 둘러싸는 박스(예: 바운딩 박스(bounding box))를 식별하고, 식별된 박스의 높이 정보에 기반하여 박스에 대응하는 사람의 거리 정보를 식별할 수 있다.As an example, the third deep neural network 130 identifies a box (e.g., bounding box) surrounding a person included in the image, and the distance of the person corresponding to the box based on the height information of the identified box. Information can be identified.

일 실시 예에 따르면, 제4 심층 신경망(140)은 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 획득할 수 있다.According to one embodiment, the fourth deep neural network 140 may obtain a pose estimation descriptor using at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof.

예를 들어, 제4 심층 신경망(140)에 외형 표현자, 상기 신체 부위 위치 데이터, 상기 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 연결(concatenation)한 데이터를 입력 데이터로써 입력하면, 자세 추정 표현자를 획득하고, 획득된 자세 추정 표현자를 출력 데이터로써 출력할 수 있다.For example, if data concatenating at least one of an appearance descriptor, the body part location data, the image pixel location information, or any combination thereof is input into the fourth deep neural network 140 as input data, A pose estimation descriptor can be obtained, and the obtained pose estimation descriptor can be output as output data.

일 실시 예에 따르면, 자세 추정 장치(100)는 제4 심층 신경망(140)에서 획득된 자세 추정 표현자에 대하여 그룹화 알고리즘을 적용하여 자세 추정 표현자를 적어도 하나의 그룹으로 그룹화(grouping) 할 수 있다.According to one embodiment, the posture estimation apparatus 100 may group the pose estimation descriptors into at least one group by applying a grouping algorithm to the pose estimation descriptors obtained from the fourth deep neural network 140. .

예를 들어, 자세 추정 장치(100)는 적어도 하나의 자세 추정 표현자에 대하여 유사도 기반 그룹화 알고리즘을 적용하여 그룹화 된 적어도 하나의 그룹을 생성할 수 있다.For example, the posture estimation apparatus 100 may generate at least one group by applying a similarity-based grouping algorithm to at least one posture estimation presenter.

일 실시 예에 따르면, 자세 추정 장치(100)는 적어도 하나의 그룹 각각을 적어도 하나의 사람 각각의 자세 추정 결과로 출력할 수 있다.According to one embodiment, the posture estimation device 100 may output each of at least one group as a posture estimation result for each of at least one person.

예를 들어, 자세 추정 장치(100)는 적어도 하나의 그룹에 포함된 제1 그룹을, 적어도 하나의 사람에 포함된 제1 사람의 자세 추정 결과로 출력할 수 있다.For example, the posture estimation apparatus 100 may output the first group included in at least one group as the posture estimation result of the first person included in at least one person.

예를 들어, 자세 추정 장치(100)는 사람의 자세에 관한 학습 레이블(예: 도 2의 보행자 자세(253))을 이용하여 보행자 별로 구분되는 자세 추정 결과를 출력할 수 있다.For example, the posture estimation device 100 may output posture estimation results classified for each pedestrian using a learning label related to the person's posture (e.g., pedestrian posture 253 in FIG. 2).

도 2는 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작을 나타낸 개념도이다.Figure 2 is a conceptual diagram showing the operation of a posture estimation device according to an embodiment of the present invention.

도 2에 도시된 자세 추정 장치(100)의 구성들 중, 도 1에 도시된 것과 동일한 구성에 대한 중복되는 설명은 상술한 도 1의 설명에 의해 대체될 수 있다.Among the configurations of the posture estimation device 100 shown in FIG. 2, overlapping descriptions of the same configurations as those shown in FIG. 1 may be replaced by the description of FIG. 1 described above.

일 실시 예에 따르면, 자세 추정 장치(100)는 카메라를 이용하여 입력 영상(205)을 획득하고, 획득된 입력 영상(205)을 표현자 추출부(210)(예: 도 1의 제1 심층 신경망(110))로 입력할 수 있다.According to one embodiment, the posture estimation device 100 acquires an input image 205 using a camera, and uses the acquired input image 205 to extract the presenter 210 (e.g., the first depth image of FIG. 1 ). It can be input into a neural network (110).

일 실시 예에 따르면, 표현자 추출부(210)는 입력 영상(205)으로부터 외형 표현자(215)를 추출할 수 있다. 외형 표현자는, 입력 영상(205)에 포함된 적어도 하나의 보행자(또는, 사람) 각각의 외형을 나타내는 특징점을 포함할 수 있다.According to one embodiment, the presenter extractor 210 may extract the appearance presenter 215 from the input image 205. The appearance descriptor may include feature points representing the appearance of each of at least one pedestrian (or person) included in the input image 205.

일 실시 예에 따르면, 자세 추정 장치(100)는 표현자 추출부(210)에서 획득(또는, 출력)된 외형 표현자(215)를 신체 부위 위치 추정부(220)(예: 도 1의 제2 심층 신경망(120)), 3차원 공간 위치 추정부(예: 도 1의 제3 심층 신경망(130)), 및/또는 자세 표현자 추출부(240)(예: 도 1의 제4 심층 신경망(140))로 출력할 수 있다.According to one embodiment, the posture estimation device 100 matches the appearance presenter 215 acquired (or output) from the presenter extraction unit 210 to the body part position estimation unit 220 (e.g., the 2 deep neural network 120), a three-dimensional spatial position estimation unit (e.g., the third deep neural network 130 in FIG. 1), and/or a posture descriptor extractor 240 (e.g., the fourth deep neural network in FIG. 1) It can be output as (140)).

일 실시 예에 따르면, 신체 부위 위치 추정부(220)는 외형 표현자(215)를 이용하여 생성한 입력 영상(205)에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 영상 내 신체 부위 위치(225)(또는, 신체 부위 위치 데이터)를 자세 표현자 추출부(240)로 출력할 수 있다.According to one embodiment, the body part location estimation unit 220 determines the body part location within the image regarding the location of at least one human body part included in the input image 205 generated using the appearance descriptor 215. (225) (or body part location data) can be output to the posture descriptor extractor 240.

일 실시 예에 따르면, 3차원 공간 위치 추정부(230)는 외형 표현자(215)를 이용하여 생성한 영상 픽셀의 위치 정보(235)를 자세 표현자 추출부(240)로 출력할 수 있다. 예를 들어, 3차원 공간 위치 추정부(230)는 영상 픽셀의 위치 정보(235)를 생성하기 위하여, 보행자 3차원 위치 정보(237)(또는, 3차원 거리 정보)를 더 이용할 수 있다.According to one embodiment, the 3D spatial position estimation unit 230 may output the positional information 235 of the image pixel generated using the appearance descriptor 215 to the posture descriptor extractor 240. For example, the 3D spatial location estimation unit 230 may further use the 3D pedestrian location information 237 (or 3D distance information) to generate the location information 235 of the image pixel.

예를 들어, 3차원 위치 정보(237)는 영상 내에 신체 부위가 존재하는 픽셀의 일 영역을 지정된 색상으로 표시한 정보를 포함할 수 있다.For example, the 3D location information 237 may include information indicating a region of a pixel where a body part is present in the image in a designated color.

일 예로, 3차원 공간 위치 추정부(230)는 영상 내에서 적어도 하나의 사람의 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 카메라로부터의 거리 정보에 따라 결정된 색상으로 표시된 3차원 위치 정보(237)를 더 이용하여, 영상 픽셀의 위치 정보(235)를 획득할 수 있다.As an example, the 3D spatial position estimation unit 230 provides 3D location information displayed in a color determined according to distance information from the camera among a plurality of colors for a pixel corresponding to at least one body part of a person in an image. By further using 237, location information 235 of the image pixel can be obtained.

일 예로, 3차원 위치 정보(237)는, 영상 내에 포함된 사람을 둘러싸는 박스의 높이 정보를 포함할 수 있다.As an example, the 3D location information 237 may include height information of a box surrounding a person included in the image.

예를 들어, 3차원 공간 위치 추정부(230)는 영상 내에 포함된 적어도 하나의 사람 각각에 대응하는 박스(box)의 높이 정보를 포함하는 3차원 위치 정보(237)를 더 이용하여, 영상 픽셀 위치 정보(235)를 획득할 수 있다.For example, the 3D spatial location estimation unit 230 further uses the 3D location information 237 including height information of a box corresponding to each of at least one person included in the image, to determine the image pixel. Location information 235 can be obtained.

일 실시 예에 따르면, 자세 표현자 추출부(240)는 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자(245)를 출력할 수 있다.According to one embodiment, the posture descriptor extractor 240 outputs the posture estimate descriptor 245 using at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof. You can.

예를 들어, 자세 추정 표현자(245)는 입력 영상(205)에 포함된 적어도 하나의 사람의 자세 표현을 위해 추정된 가공 데이터를 포함할 수 있다.For example, the posture estimation descriptor 245 may include processed data estimated to represent the posture of at least one person included in the input image 205.

예를 들어, 자세 표현자 추출부(240)는 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 연결(concatenation)한 입력 데이터를 이용하여, 자세 추정 표현자(245)를 출력할 수 있다.For example, the posture descriptor extractor 240 uses input data that concatenates at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof to express a posture estimate. Characters (245) can be output.

일 실시 예에 따르면, 자세 추정 장치는 유사도 기반 그룹화 알고리즘(250)을 이용하여 자세 추정 표현자(245)를 그룹화 할 수 있다.According to one embodiment, the posture estimation device may group the posture estimation descriptors 245 using a similarity-based grouping algorithm 250.

예를 들어, 자세 추정 장치는 적어도 하나의 자세 추정 표현자(245)에 대하여 유사도 기반 그룹화 알고리즘(250)을 적용하여, 적어도 하나의 자세 추정 표현자(245)를 적어도 하나의 그룹으로 그룹화하고, 그룹화 된 각각의 그룹을 보행자 별 자세 추정 결과(255)로 출력할 수 있다.For example, the posture estimation device applies the similarity-based grouping algorithm 250 to the at least one posture estimation descriptor 245 to group the at least one posture estimate descriptor 245 into at least one group, Each grouped group can be output as a posture estimation result for each pedestrian (255).

예를 들어, 자세 추정 장치는 기 저장된(또는, 외부로부터 수신한) 보행자 자세 레이블(253)을 더 이용하여, 보행자 별 자세 추정 결과(255)를 출력할 수 있다. 일 예로, 보행자 자세 레이블(253)은 적어도 하나의 표현자에 기반하여 자세 추정 결과를 생성하는 동작의 학습을 위하여 마련된 학습 데이터(또는, 학습 레이블)일 수 있다.For example, the posture estimation device may output the posture estimation result 255 for each pedestrian by further using the pedestrian posture label 253 that is previously stored (or received from the outside). As an example, the pedestrian posture label 253 may be learning data (or learning label) prepared for learning an operation that generates a posture estimation result based on at least one presenter.

도 3은 본 발명의 일 실시 예에 따른 영상 픽셀 위치 정보를 획득하는 과정을 도시한 개념도이다.Figure 3 is a conceptual diagram illustrating a process for obtaining image pixel location information according to an embodiment of the present invention.

일 실시 예에 따르면, 자세 추정 장치(예: 도 1의 자세 추정 장치(100))는 3차원 공간 상에서의 적어도 하나의 사람의 위치 정보를 식별할 수 있다.According to one embodiment, a posture estimation device (e.g., the posture estimation device 100 of FIG. 1) may identify location information of at least one person in three-dimensional space.

예를 들어, 자세 추정 장치는 특정 지점(350)에 위치하며, 카메라를 이용하여 적어도 하나의 사람을 포함하는 영상을 획득할 수 있다. 일 예로, 카메라의 화각(A) 내에는 복수의 사람들이 포함될 수 있다.For example, the posture estimation device is located at a specific point 350 and can acquire an image including at least one person using a camera. As an example, a plurality of people may be included within the camera's field of view (A).

예를 들어, 자세 추정 장치는 특정 지점(350)을 기준으로 적어도 하나의 사람 각각까지의 거리 정보를 획득할 수 있다. For example, the posture estimation device may obtain distance information to each of at least one person based on a specific point 350.

예를 들어, 자세 추정 장치는 획득된 거리 정보를 포함하는 3차원 거리 정보를 식별할 수 있다. 일 예로, 자세 추정 장치는 영상 내에서 적어도 하나의 사람의 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 거리 정보에 따라 결정된 지정된 색상으로 표시한 3차원 거리 정보를 식별할 수 있다. 3차원 거리 정보에 대한 구체적 예시는 후술할 도 6에 대한 설명에서 자세히 참조될 수 있다.For example, the posture estimation device may identify 3D distance information including the acquired distance information. As an example, the posture estimation device may identify 3D distance information displayed in a designated color determined according to the distance information among a plurality of colors for a pixel corresponding to at least one body part of a person in an image. Specific examples of 3D distance information can be referred to in detail in the description of FIG. 6 to be described later.

도 4는 본 발명의 일 실시 예에 따른 자세 추정 장치가 획득하거나 생성한 데이터의 예시를 나타낸다.Figure 4 shows an example of data acquired or generated by a posture estimation device according to an embodiment of the present invention.

일 실시 예에 따르면, 자세 추정 장치(예: 도 1의 자세 추정 장치(100))는 카메라를 이용하여 영상(410)(예: 도 2의 입력 영상(205))을 획득할 수 있다.According to one embodiment, a posture estimation device (e.g., the posture estimation device 100 of FIG. 1) may acquire an image 410 (e.g., the input image 205 of FIG. 2) using a camera.

예를 들어, 도 4를 참조하면, 영상(410)에는 2명의 사람이 포함될 수 있다. 자세 추정 장치는 2명의 사람이 포함된 영상(410)을 카메라를 이용하여 획득하고, 획득된 영상(410)을 제1 심층 신경망(예: 도 1의 제1 심층 신경망(110))에 입력할 수 있다.For example, referring to FIG. 4 , image 410 may include two people. The posture estimation device acquires an image 410 including two people using a camera, and inputs the acquired image 410 into a first deep neural network (e.g., the first deep neural network 110 in FIG. 1). You can.

일 실시 예에 따르면, 자세 추정 장치는 제1 심층 신경망을 이용하여 외형 표현자(420)를 획득할 수 있다.According to one embodiment, the posture estimation device may acquire the appearance descriptor 420 using a first deep neural network.

예를 들어, 자세 추정 장치는 제1 심층 신경망에 영상(410)을 입력 데이터로써 입력하고, 제1 심층 신경망의 연산에 기반하여 생성된 외형 표현자(420)를 출력 데이터로써 획득할 수 있다.For example, the posture estimation device may input the image 410 as input data to the first deep neural network and obtain the appearance descriptor 420 generated based on the calculation of the first deep neural network as output data.

예를 들어, 자세 추정 장치는 외형 표현자(420)를 제2 심층 신경망(예: 도 1의 제2 심층 신경망(120)), 제3 심층 신경망(예: 도 1의 제3 심층 신경망(130)), 및 제4 심층 신경망(예: 도 1의 제4 심층 신경망(140))에 입력할 수 있다.For example, the posture estimation device uses the appearance descriptor 420 as a second deep neural network (e.g., the second deep neural network 120 in FIG. 1) and a third deep neural network (e.g., the third deep neural network 130 in FIG. 1). )), and can be input to a fourth deep neural network (e.g., the fourth deep neural network 140 of FIG. 1).

일 실시 예에 따르면, 자세 추정 장치는 제2 심층 신경망을 이용하여 신체 부위 위치 데이터(430)를 획득할 수 있다.According to one embodiment, the posture estimation device may acquire body part location data 430 using a second deep neural network.

예를 들어, 제2 심층 신경망은 외형 표현자(420)를 이용하여, 영상(410)에 포함된 2명의 사람의 신체 부위의 위치에 관한 정보를 포함하는 신체 부위 위치 데이터(430)를 출력할 수 있다.For example, the second deep neural network uses the appearance descriptor 420 to output body part location data 430 containing information about the positions of body parts of two people included in the image 410. You can.

일 실시 예에 따르면, 자세 추정 장치는 제3 심층 신경망을 이용하여 영상 픽셀 위치 정보(440)를 획득할 수 있다.According to one embodiment, the posture estimation device may acquire image pixel location information 440 using a third deep neural network.

예를 들어, 제3 심층 신경망은 외형 표현자(420)를 이용하여, 영상 픽셀 위치 정보(440)를 획득할 수 있다.For example, the third deep neural network may use the appearance descriptor 420 to obtain image pixel location information 440.

예를 들어, 제3 심층 신경망은 3차원 거리 정보를 더 이용하여, 영상 픽셀 위치 정보(440)를 획득할 수 있다.For example, the third deep neural network may further use 3D distance information to obtain image pixel location information 440.

일 예로, 3차원 거리 정보는 영상 내에 신체 부위가 존재하는 픽셀의 일 영역을 지정된 색상으로 표시한 정보를 포함할 수 있다.For example, 3D distance information may include information indicating a region of a pixel where a body part is present in an image in a designated color.

일 실시 예에 따르면, 자세 추정 장치는 제4 심층 신경망을 이용하여 자세 추정 표현자를 출력하고, 출력된 적어도 하나의 자세 추정 표현자를 그룹화하고, 적어도 하나의 그룹 각각을 적어도 하나의 사람 각각의 자세 추정 결과(450)로 출력할 수 있다.According to one embodiment, the posture estimation device outputs a posture estimation descriptor using a fourth deep neural network, groups at least one output posture estimation descriptor, and groups each of the at least one group into a posture estimate for each of at least one person. It can be output as result (450).

예를 들어, 제4 심층 신경망는 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 출력할 수 있다.For example, the fourth deep neural network may output a pose estimation descriptor using at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof.

예를 들어, 자세 추정 장치는 자세 추정 표현자에 대하여 유사도 기반 그룹화 알고리즘을 적용하여, 적어도 하나의 그룹을 생성할 수 있다.For example, the posture estimation device may apply a similarity-based grouping algorithm to the posture estimation descriptor to create at least one group.

예를 들어, 자세 추정 장치는 적어도 하나의 그룹 중 제1 그룹을 영상(410)에 포함된 2명의 사람 중 제1 사람의 자세 추정 결과로 출력할 수 있다.For example, the posture estimation device may output the first group among at least one group as the posture estimation result of the first person among the two people included in the image 410.

예를 들어, 자세 추정 장치는 적어도 하나의 그룹 중 제2 그룹을 영상(410)에 포함된 2명의 사람 중 제2 사람의 자세 추정 결과로 출력할 수 있다.For example, the posture estimation device may output the second group among at least one group as the posture estimation result of the second person among the two people included in the image 410.

도 5는 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작을 나타낸 개념도이다.Figure 5 is a conceptual diagram showing the operation of a posture estimation device according to an embodiment of the present invention.

도 5에 도시된 자세 추정 장치의 구성 및 자세 추정 장치에 의해 생성되는 데이터들 중, 도 2에 도시된 것과 동일한 구성에 대한 중복되는 설명은 상술한 도 2의 설명에 의해 대체될 수 있다.Among the configuration of the posture estimation device shown in FIG. 5 and the data generated by the posture estimation device, overlapping descriptions of the same configuration as shown in FIG. 2 may be replaced by the description of FIG. 2 described above.

일 실시 예에 따르면, 자세 추정 장치(예: 도 1의 자세 추정 장치(100))는 카메라를 이용하여 입력 영상(505)(예: 도 2의 입력 영상(205))을 획득할 수 있다.According to one embodiment, a posture estimation device (e.g., the posture estimation device 100 of FIG. 1) may acquire an input image 505 (e.g., the input image 205 of FIG. 2) using a camera.

일 실시 예에 따르면, 자세 추정 장치는 입력 영상(505)을 N1(510)(예: 도 2의 표현자 추출부(210))에 입력하고, N1(510)으로부터 출력되는 외형 표현자(515)를 획득할 수 있다.According to one embodiment, the posture estimation device inputs the input image 505 to N1 (510) (e.g., the presenter extractor 210 of FIG. 2) and uses the appearance descriptor 515 output from N1 (510). ) can be obtained.

일 실시 예에 따르면, 자세 추정 장치는 외형 표현자(515)를 N2(520)(예: 도 2의 신체 부위 위치 추정부(220)), N3(530)(예: 도 2의 3차원 공간 위치 추정부(230)), 및 N4(540)(예: 도 2의 자세 표현자 추출부(240))에 입력할 수 있다.According to one embodiment, the posture estimation device divides the appearance descriptor 515 into N2 (520) (e.g., body part position estimation unit 220 of FIG. 2) and N3 (530) (e.g., three-dimensional space of FIG. 2). It can be input to the position estimation unit 230) and N4 540 (e.g., the posture descriptor extraction unit 240 in FIG. 2).

일 실시 예에 따르면, N2(520)는 외형 표현자(515)를 이용하여 신체 부위 위치(525)(예: 도 2의 영상 내 신체 부위 위치(525))를 출력할 수 있다.According to one embodiment, N2 520 may output a body part location 525 (eg, body part location 525 in the image of FIG. 2) using the appearance descriptor 515.

일 실시 예에 따르면, N3(530)는 외형 표현자(515)를 이용하여 3차원 위치 정보(예: 도 2의 영상 픽셀의 위치 정보(535))를 출력할 수 있다.According to one embodiment, N3 530 may output 3D position information (eg, position information 535 of the image pixel in FIG. 2) using the appearance descriptor 515.

일 실시 예에 따르면, 자세 추정 장치는 외형 표현자(515), 신체 부위 위치(525), 3차원 위치 정보(535), 또는 이들의 어느 조합 중의 적어도 하나를 연결(concatenation)한 입력 데이터(527)를 N4(540)에 입력할 수 있다.According to one embodiment, the posture estimation device includes input data 527 that concatenates at least one of an appearance descriptor 515, body part location 525, 3D location information 535, or any combination thereof. ) can be entered into N4 (540).

일 실시 예에 따르면, N4(540)는 입력 데이터(527)를 이용하여 자세 추정 표현자(545)(예: 도 2의 자세 추정 표현자(245))를 출력할 수 있다.According to one embodiment, N4 540 may output a posture estimation descriptor 545 (e.g., the posture estimation descriptor 245 of FIG. 2) using the input data 527.

일 실시 예에 따르면, 자세 추정 장치는 자세 추정 표현자(545)를 이용하여 자세 추정 결과(555)(예: 도 2의 보행자 별 자세 추정 결과(255))를 출력할 수 있다.According to one embodiment, the posture estimation device may output a posture estimation result 555 (e.g., the posture estimation result 255 for each pedestrian in FIG. 2) using the posture estimation descriptor 545.

도 6은 본 발명의 일 실시 예에 따른 영상 픽셀 위치 정보를 획득하는 방법을 나타낸 개념도이다.Figure 6 is a conceptual diagram showing a method of obtaining image pixel location information according to an embodiment of the present invention.

일 실시 예에 따르면, 자세 추정 장치(예: 도 1의 자세 추정 장치(100))는 영상에 포함된 적어도 하나의 사람 각각의 3차원 거리 정보를 식별할 수 있다.According to one embodiment, a posture estimation device (e.g., the posture estimation device 100 of FIG. 1) may identify 3D distance information for each of at least one person included in an image.

참조 번호 610을 참조하여, 일 실시 예에 따르면, 자세 추정 장치는 영상을 획득한 카메라가 존재하는 위치로부터, 영상 내에 포함된 적어도 하나의 사람까지의 거리 정보를 서로 다른 색상으로 표시한 3차원 거리 정보를 식별할 수 있다.Referring to reference number 610, according to one embodiment, the posture estimation device displays three-dimensional distance information in different colors from the location where the camera that acquired the image is to at least one person included in the image. Information can be identified.

예를 들어, 자세 추정 장치는, 영상 내에서 적어도 하나의 사람의 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 카메라로부터의 거리 정보에 따라 결정된 색상으로 표시된 3차원 거리 정보를 이용하여 영상 픽셀 위치 정보를 획득할 수 있다.For example, the posture estimation device uses 3D distance information displayed in a color determined according to distance information from the camera among a plurality of colors for a pixel corresponding to at least one body part of a person in the image to determine the image. Pixel location information can be obtained.

예를 들어, 자세 추정 장치는, 복수의 색상들을 이격 거리에 따라 결정하는 기준 테이블(615)을 저장할 수 있다.For example, the posture estimation device may store a reference table 615 that determines a plurality of colors according to the separation distance.

일 예로, 자세 추정 장치는 기준 테이블(615)에 기반하여, 카메라로부터 상대적으로 먼 거리에 위치하는 사람의 신체 부위는 적색으로 표시하고, 카메라로부터 상대적으로 가까운 거리에 위치하는 사람의 신체 부위는 청색으로 표시하고, 그 사이의 거리에 위치하는 사람의 신체 부위는 녹색으로 표시한 3차원 거리 정보를 생성할 수 있다.For example, based on the reference table 615, the posture estimation device displays body parts of a person relatively far from the camera in red, and displays body parts of a person relatively close to the camera in blue. , and the human body parts located in the distance between them can generate 3D distance information indicated in green.

참조 번호 620을 참조하여, 일 실시 예에 따르면, 자세 추정 장치는 영상 내에 포함된 적어도 하나의 사람 각각에 대응하는 박스(box)의 높이 정보를 이용하여 적어도 하나의 사람까지의 거리 정보를 식별하도록 하는 3차원 거리 정보를 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다.Referring to reference number 620, according to one embodiment, the posture estimation device identifies distance information to at least one person using height information of a box corresponding to each of at least one person included in the image. Image pixel location information can be obtained using the 3D distance information.

예를 들어, 자세 추정 장치는, 영상에 포함된 제1 사람(621)에 대응하는 제1 박스(622)의 높이 정보를 식별하고, 높이 정보에 기반하여 자세 추정 장치(또는, 카메라)로부터 제1 사람(621)까지의 거리 정보를 식별하고, 식별된 거리 정보를 포함하는 3차원 거리 정보를 생성할 수 있다.For example, the posture estimation device identifies height information of the first box 622 corresponding to the first person 621 included in the image, and receives information from the posture estimation device (or camera) based on the height information. Distance information to 1 person 621 can be identified and 3D distance information including the identified distance information can be generated.

도 7은 본 발명의 일 실시 예에 따른 자세 추정 장치가 생성한 자세 추정 결과 데이터의 예시를 나타낸다.Figure 7 shows an example of posture estimation result data generated by a posture estimation device according to an embodiment of the present invention.

참조 번호 700을 참조하여, 일 실시 예에 따르면, 자세 추정 장치는 자세 추정 표현자의 그룹화에 기반하여, 영상에 포함된 적어도 하나의 사람 각각의 자세 추정 결과를 생성할 수 있다.Referring to reference number 700, according to one embodiment, the posture estimation apparatus may generate a posture estimation result for each of at least one person included in the image based on the grouping of the posture estimation presenters.

일 실시 예에 따르면, 자세 추정 장치는 자세 추정 표현자에 유사도 기반 그룹화 알고리즘을 적용하여, 적어도 하나의 자세 추정 표현자를 포함하는 그룹을 생성할 수 있다.According to one embodiment, the posture estimation device may apply a similarity-based grouping algorithm to the posture estimation descriptor to create a group including at least one posture estimation descriptor.

예를 들어, 자세 추정 장치는 영상의 좌측 하단에 존재하는 제1 사람에 대응하는 제1 그룹(710)을 생성할 수 있다. 일 예로, 제1 그룹(710)은 제1 사람의 신체 부위 중 적어도 일부를 연결한 선의 집합일 수 있다.For example, the posture estimation device may generate a first group 710 corresponding to the first person present in the lower left corner of the image. As an example, the first group 710 may be a set of lines connecting at least some of the body parts of the first person.

예를 들어, 자세 추정 장치는 영상의 좌측 하단의 제1 사람 옆에 존재하는 제2 사람에 대응하는 제2 그룹(720)을 생성할 수 있다. 일 예로, 제2 그룹(720)은 제2 사람의 신체 부위 중 적어도 일부를 연결한 선의 집합일 수 있다.For example, the posture estimation device may create a second group 720 corresponding to a second person existing next to the first person in the lower left corner of the image. As an example, the second group 720 may be a set of lines connecting at least some of the body parts of the second person.

예를 들어, 자세 추정 장치는 영상의 우측 하단의 제3 사람에 대응하는 제3 그룹(730)을 생성할 수 있다. 일 예로, 제3 그룹(730)은 제3 사람의 신체 부위 중 적어도 일부를 연결한 선의 집합일 수 있다.For example, the posture estimation device may create a third group 730 corresponding to the third person in the lower right corner of the image. As an example, the third group 730 may be a set of lines connecting at least some of the body parts of a third person.

예를 들어, 자세 추정 장치는 영상의 우측 하단의 제3 사람 옆에 존재하는 제4 사람에 대응하는 제4 그룹(740)을 생성할 수 있다. 일 예로, 제4 그룹(740)은 제4 사람의 신체 부위 중 적어도 일부를 연결한 선의 집합일 수 있다.For example, the posture estimation device may create a fourth group 740 corresponding to the fourth person present next to the third person at the bottom right of the image. As an example, the fourth group 740 may be a set of lines connecting at least some of the body parts of the fourth person.

참조 번호 700에서와 같이, 자세 추정 장치는 영상에 복수의 사람들이 겹쳐져서 포함된 경우에도, 각각의 사람에 대응하는 정확한 자세 추정 결과를 생성할 수 있다.As shown in reference numeral 700, the posture estimation device can generate accurate posture estimation results corresponding to each person even when a plurality of people are included in the image overlapping.

도 8은 본 발명의 일 실시 예에 따른 자세 추정 장치의 동작 순서도이다.Figure 8 is an operation flowchart of a posture estimation device according to an embodiment of the present invention.

일 실시에 따르면, 자세 추정 장치(예: 도 1의 자세 추정 장치(100))는 도 8에 개시된 동작들을 수행할 수 있다. 예를 들어, 자세 추정 장치가 포함하는 구성 요소들(예: 도 1의 제1 심층 신경망(110), 제2 심층 신경망(120), 제3 심층 신경망(130), 및/또는 제4 심층 신경망(140)) 중 적어도 일부는 도 8의 동작들을 수행하도록 설정될 수 있다. According to one implementation, a posture estimation device (e.g., the posture estimation device 100 of FIG. 1) may perform the operations disclosed in FIG. 8. For example, components included in the posture estimation device (e.g., the first deep neural network 110, the second deep neural network 120, the third deep neural network 130, and/or the fourth deep neural network in FIG. 1 At least some of (140)) may be set to perform the operations of FIG. 8.

이하 실시 예에서 S810 내지 S840의 동작은 순차적으로 수행될 수도 있으나, 반드시 순차적으로 수행되는 것은 아니다. 예를 들어, 각 동작들의 순서가 변경될 수도 있으며, 적어도 두 동작들이 병렬적으로 수행될 수도 있다. 또한, 도 8과 관련하여 전술한 내용과 대응되거나 중복되는 내용은 간략히 설명하거나 생략할 수 있다.In the following embodiments, operations S810 to S840 may be performed sequentially, but are not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel. Additionally, content that corresponds to or overlaps with the content described above in relation to FIG. 8 may be briefly explained or omitted.

일 실시 예에 따르면, 제1 심층 신경망은 카메라를 이용하여 획득한 영상으로부터 외형 표현자(representation)를 획득할 수 있다(S810).According to one embodiment, the first deep neural network may obtain an appearance presenter from an image acquired using a camera (S810).

예를 들어, 외형 표현자는 영상 내에 포함된 적어도 하나의 사람 각각의 적어도 하나의 외형 정보(예: 위치)를 구분하여 식별 가능하도록 하는 데이터를 포함할 수 있다. For example, the appearance descriptor may include data that enables identification by distinguishing at least one piece of appearance information (e.g., location) of each of at least one person included in the image.

예를 들어, 제1 심층 신경망은, 백본(backbone) 네트워크를 통해 상기 영상으로부터 특징점(feature)을 추출하여 외형 표현자를 획득할 수 있다. For example, the first deep neural network can obtain an appearance descriptor by extracting features from the image through a backbone network.

일 실시 예에 따르면, 제2 심층 신경망은 외형 표현자를 이용하여 영상에 포함된 적어도 하나의 사람의 신체 부위의 위치에 관한 신체 부위 위치 데이터를 획득할 수 있다(S820).According to one embodiment, the second deep neural network may obtain body part location data regarding the location of at least one body part of a person included in an image using an appearance descriptor (S820).

예를 들어, 제2 심층 신경망은, 외형 표현자를 이용하여 신체 부위의 개수만큼의 확률 맵을 생성하고, 확률 맵에 포함된 복수의 픽셀 위치들 중 확률 값이 지정된 값 이상인 픽셀 위치를 이용하여 신체 부위 위치 데이터를 획득할 수 있다.For example, the second deep neural network uses an appearance descriptor to generate a probability map equal to the number of body parts, and uses pixel positions whose probability value is greater than a specified value among a plurality of pixel positions included in the probability map to Part location data can be obtained.

일 실시 예에 따르면, 제3 심층 신경망은 외형 표현자를 이용하여 영상 픽셀 위치 정보를 획득할 수 있다(S830).According to one embodiment, the third deep neural network can obtain image pixel location information using an appearance descriptor (S830).

예를 들어, 제3 심층 신경망은, 카메라가 존재하는 위치로부터, 영상 내에 포함된 적어도 하나의 사람까지의 거리 정보를 포함하는 3차원 거리 정보를 더 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다. 일 예로, 3차원 거리 정보는 영상 내에서 적어도 하나의 사람의 신체 부위에 대응하는 픽셀에 대하여, 복수의 색상들 중에서 카메라로부터의 거리 정보에 따라 결정된 색상으로 표시된 정보를 포함할 수 있다.For example, the third deep neural network may obtain image pixel location information by further using 3D distance information including distance information from the location where the camera is to at least one person included in the image. . As an example, the 3D distance information may include information about a pixel corresponding to at least one body part of a person in an image displayed in a color determined according to distance information from the camera among a plurality of colors.

예를 들어, 제3 심층 신경망은, 영상 내에 포함된 적어도 하나의 사람 각각에 대응하는 박스(box)의 높이 정보를 더 이용하여, 영상 픽셀 위치 정보를 획득할 수 있다.For example, the third deep neural network may obtain image pixel location information by further using height information of a box corresponding to each of at least one person included in the image.

일 실시 예에 따르면, 제4 심층 신경망은 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 이용하여 자세 추정 표현자를 획득할 수 있다(S840).According to one embodiment, the fourth deep neural network may obtain a pose estimation descriptor using at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof (S840).

예를 들어, 제4 심층 신경망은, 외형 표현자, 신체 부위 위치 데이터, 영상 픽셀 위치 정보, 또는 이들의 어느 조합 중의 적어도 하나를 연결(concatenation)한 입력 데이터를 이용하여, 자세 추정 표현자를 출력할 수 있다.For example, the fourth deep neural network may output a pose estimation descriptor using input data concatenating at least one of an appearance descriptor, body part location data, image pixel location information, or any combination thereof. You can.

일 실시 예에 따르면, 자세 추정 장치는 자세 추정 표현자에 유사도 기반 그룹화 알고리즘을 적용하여 자세 추정 표현자를 적어도 하나의 그룹으로 그룹화하고, 적어도 하나의 그룹 각각을 적어도 하나의 사람 각각의 자세 추정 결과로 출력할 수 있다.According to one embodiment, the posture estimation device applies a similarity-based grouping algorithm to the posture estimation descriptors to group the posture estimate descriptors into at least one group, and divides each of the at least one groups into a posture estimation result for each of at least one person. Can be printed.

도 9는 본 발명의 일 실시 예에 따른 자세 추정 방법에 관한 컴퓨팅 시스템을 도시한다.Figure 9 shows a computing system for a posture estimation method according to an embodiment of the present invention.

도 9을 참조하면, 자세 추정 방법에 관한 컴퓨팅 시스템(1000)은 버스(1200)를 통해 연결되는 적어도 하나의 프로세서(1100), 메모리(1300), 사용자 인터페이스 입력 장치(1400), 사용자 인터페이스 출력 장치(1500), 스토리지(1600), 및 네트워크 인터페이스(1700)를 포함할 수 있다. Referring to FIG. 9, a computing system 1000 for a posture estimation method includes at least one processor 1100, a memory 1300, a user interface input device 1400, and a user interface output device connected through a bus 1200. It may include (1500), storage (1600), and network interface (1700).

프로세서(1100)는 중앙 처리 장치(CPU) 또는 메모리(1300) 및/또는 스토리지(1600)에 저장된 명령어들에 대한 처리를 실행하는 반도체 장치일 수 있다. 메모리(1300) 및 스토리지(1600)는 다양한 종류의 휘발성 또는 불휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리(1300)는 ROM(read only memory) 및 RAM(random access memory)을 포함할 수 있다. The processor 1100 may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in the memory 1300 and/or storage 1600. Memory 1300 and storage 1600 may include various types of volatile or non-volatile storage media. For example, the memory 1300 may include read only memory (ROM) and random access memory (RAM).

따라서, 본 명세서에 개시된 실시 예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서(1100)에 의해 실행되는 하드웨어, 소프트웨어 모듈, 또는 그 2 개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리(1300) 및/또는 스토리지(1600))에 상주할 수도 있다. Accordingly, the steps of the method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware or software modules executed by the processor 1100, or a combination of the two. Software modules reside in a storage medium (i.e., memory 1300 and/or storage 1600), such as RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, or CD-ROM. You may.

예시적인 저장 매체는 프로세서(1100)에 커플링되며, 그 프로세서(1100)는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서(1100)와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.An exemplary storage medium is coupled to processor 1100, which can read information from and write information to the storage medium. Alternatively, the storage medium may be integrated with processor 1100. The processor and storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and storage medium may reside as separate components within the user terminal.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다.The above description is merely an illustrative explanation of the technical idea of the present invention, and various modifications and variations will be possible to those skilled in the art without departing from the essential characteristics of the present invention.

따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but are for illustrative purposes, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be construed as being included in the scope of rights of the present invention.

Claims

In the posture estimation device,
A first deep neural network that obtains an appearance presenter from an image acquired using a camera;
a second deep neural network that obtains body part location data regarding the location of at least one body part of a person included in the image using the appearance descriptor;
a third deep neural network that obtains image pixel location information using the appearance descriptor; and
a fourth deep neural network that obtains a pose estimation descriptor using at least one of the appearance descriptor, the body part location data, the image pixel location information, or any combination thereof; Including,
Posture estimation device.

In claim 1,
The posture estimation device,
Grouping the pose estimation descriptors into at least one group by applying a similarity-based grouping algorithm to the pose estimation descriptors obtained from the fourth deep neural network,
configured to output each of the at least one group as a posture estimation result of each of the at least one person,
Posture estimation device.

In claim 1,
The first deep neural network is,
Configured to obtain an appearance descriptor by extracting features from the image through a backbone network,
Posture estimation device.

In claim 1,
The second deep neural network is,
Generate a probability map corresponding to the number of body parts using the appearance descriptor, and obtain body part location data using pixel positions whose probability value is greater than or equal to a specified value among a plurality of pixel positions included in the probability map. composed,
Posture estimation device.

In claim 1,
The third deep neural network is,
configured to obtain the image pixel location information by further using three-dimensional distance information including distance information from the location where the camera is to the at least one person included in the image,
Posture estimation device.

In claim 1,
The third deep neural network is,
With respect to the pixel corresponding to the body part of the at least one person in the image, the image pixel location is determined using the 3D distance information displayed in a color determined according to the distance information from the camera among a plurality of colors. configured to obtain information,
Posture estimation device.

In claim 1,
The third deep neural network is,
configured to obtain the image pixel location information by further using height information of a box corresponding to each of the at least one person included in the image,
Posture estimation device.

In claim 1,
The fourth deep neural network is,
configured to output the pose estimation descriptor using input data concatenating at least one of the appearance descriptor, the body part location data, the image pixel location information, or any combination thereof,
Posture estimation device.

In claim 1,
The body part location data is,
Containing at least one pixel information corresponding to coordinates for each of the body parts of the at least one person included in the image,
Posture estimation device.

In claim 1,
At least one of the second deep neural network, the third deep neural network, the fourth deep neural network, or any combination thereof includes a convolutional neural network (CNN),
Self-estimating device.

In the posture estimation method,
Obtaining, by a first deep neural network, an appearance presenter from an image acquired using a camera;
obtaining, by a second deep neural network, body part location data regarding the location of at least one body part of a person included in the image using the appearance descriptor;
A third deep neural network acquiring image pixel location information using the appearance descriptor; and
Obtaining, by a fourth deep neural network, a pose estimation descriptor using at least one of the appearance descriptor, the body part location data, the image pixel location information, or any combination thereof; Including,
Pose estimation method.

In claim 11,
The step of the fourth deep neural network acquiring the pose descriptor is,
Grouping the pose estimation descriptors into at least one group by applying a similarity-based grouping algorithm to the pose estimation descriptors, by the pose estimation apparatus; and
outputting, by the posture estimation device, each of the at least one group as a posture estimation result for each of the at least one person; Including,
Pose estimation method.

In claim 11,
The step of the first deep neural network acquiring the appearance descriptor is,
The first deep neural network extracts features from the image through a backbone network to obtain an appearance descriptor; Including,
Pose estimation method.

In claim 11,
The step of the second deep neural network acquiring the body part location data,
The second deep neural network generates a probability map corresponding to the number of body parts using the appearance descriptor, and uses a pixel location whose probability value is greater than a specified value among a plurality of pixel locations included in the probability map. Obtaining body part location data; Including,
Pose estimation method.

In claim 11,
The step of the third deep neural network acquiring the image pixel location information,
The third deep neural network further uses three-dimensional distance information including distance information from the location where the camera is to the at least one person included in the image, to obtain the image pixel location information. ; Including,
Pose estimation method.