KR20230080804A

KR20230080804A - Apparatus and method for estimating human pose based AI

Info

Publication number: KR20230080804A
Application number: KR1020210168289A
Authority: KR
Inventors: 신사임; 김정호; 김보은; 김충일
Original assignee: 한국전자기술연구원
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2023-06-07

Abstract

An apparatus and method for estimating human pose based AI are provided. An extraction part extracts the area containing a person (hereinafter referred to as an 'original person image') from an original image. A scaling part scales the original person image extracted from the extraction part to have different resolutions and generates multiple target person images. A posture estimation part can estimate a human pose by inputting multiple target person images generated by the scaling part into different networks of artificial intelligence models previously trained to estimate the human pose.

Description

Apparatus and method for estimating human pose based AI}

본 발명은 인공지능 기반의 사람 자세 추정 장치 및 방법에 관한 것으로서, 보다 상세하게는, 다중 해상도의 영상을 인공지능 모델에 입력하여 사람 자세를 추정할 수 있는 인공지능 기반의 사람 자세 추정 장치 및 방법에 관한 것이다.The present invention relates to an artificial intelligence-based human posture estimation apparatus and method, and more particularly, to an artificial intelligence-based human posture estimation apparatus and method capable of estimating a human posture by inputting multi-resolution images to an artificial intelligence model. It is about.

2D 사람 자세 추정은(2D human pose estimation) 비전 분야에서 많은 주목을 받아왔으며, 최근 인공지능 기반의 딥러닝 도입에 의해 성능은 크게 향상되었다. 사람 자세 추정은 감시 시스템, 자율주행 등 다양한 분야에 이용되며, 행동 및 제스처 인식에 도움을 준다. 가장 많이 사용되는 자세 추정 데이터셋인 MS COCO와 MPII 데이터셋은 고해상도 영상 학습 데이터가 많다. 이에 따라 기존의 인공지능 기반 사람 자세 추출 모델은 고해상도 영상에서는 성능이 우수하나, 거리가 멀어 작게 촬영된 사람이나, 촬영 영상의 해상도가 낮은 경우에 성능이 급격하게 하락하는 문제점이 있다.2D human pose estimation has received a lot of attention in the field of vision, and its performance has been greatly improved by the recent introduction of artificial intelligence-based deep learning. Human posture estimation is used in various fields such as surveillance systems and autonomous driving, and helps in recognizing actions and gestures. MS COCO and MPII datasets, which are the most used posture estimation datasets, contain a lot of high-resolution image training data. Accordingly, the existing artificial intelligence-based human posture extraction model has excellent performance in high-resolution images, but has a problem in that its performance rapidly deteriorates when a person is photographed at a small distance or when the resolution of the photographed image is low.

선행 논문(Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.)에서 제안된 기술은 학습 데이터셋 중 사람 영역을 고정된 이미지 크기로 변환하고, 이를 모델에 입력하여 학습한 것으로, 입력 이미지 크기는 테스트 데이터셋에서 평균적으로 잘 동작하는 것으로 설정한다. 그러나, 테스트 데이터셋 중 크기가 작은 이미지에 대해서는 낮은 성능을 보이는 문제점이 있다.The technique proposed in the preceding paper (Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.) The region is converted to a fixed image size, and it is learned by inputting it to the model, and the input image size is set to what works well on average in the test dataset. However, there is a problem of showing low performance for small-sized images in the test dataset.

또한, 국내 공개특허 제10-2019-0034380호에서 제안된 기술은 입력된 영상에서 사람의 자세를 추정하고, 동작을 교정하는 과정을 포함하나, 해상도가 낮은 영상에 대해 강건하게 동작하도록 하는 부분은 고려하고 있지 않다. In addition, the technique proposed in Korean Patent Publication No. 10-2019-0034380 includes a process of estimating a person's posture from an input image and correcting the motion, but the part that enables robust operation for low-resolution images not considering

그러나, 실제 산업 분야의 적용을 위해서는 가까이 있는 사람의 자세를 추정함과 동시에 멀리 있는 사람의 자세를 추정하는 것이 유용하다. 따라서, 해상도가 낮은 영상에서 자세를 검출할 수 있는 모델이 필수적이다. However, for practical application in the industrial field, it is useful to estimate the posture of a nearby person and at the same time estimating the posture of a distant person. Therefore, a model capable of detecting a posture in a low-resolution image is essential.

또한, 기존의 기술은 모든 크기의 이미지들을 고정된 한 개의 크기로 변환하여 인공지능 학습 모델에 입력한다. 이 과정에서, 저해상도 영상을 업 스케일링하는 경우 아티펙트가 발생하여 자세 추정 정확도가 낮아지고, 고해상도 영상의 사이즈를 줄이는 경우 정보의 손실이 발생한다.In addition, the existing technology converts images of all sizes into one fixed size and inputs them to an artificial intelligence learning model. In this process, artifacts occur when up-scaling a low-resolution image, resulting in low posture estimation accuracy, and loss of information when reducing the size of a high-resolution image.

국내 공개특허 제10-2019-0034380호Korean Patent Publication No. 10-2019-0034380

전술한 문제점을 해결하기 위하여 본 발명이 이루고자 하는 기술적 과제는, 다양한 크기의 영상에서 강건하게 동작하고, 특히 작은 크기의 사람을 포함하는 영상이나 저해상도 영상에서 사람 자세 추정이 가능한 인공지능 기반의 사람 자세 추정 장치 및 방법을 제시하는 데 있다.In order to solve the above problems, the technical problem to be achieved by the present invention is an artificial intelligence-based human posture that operates robustly in images of various sizes and can estimate a human posture in an image including a small-sized person or a low-resolution image. It is to present an estimation device and method.

본 발명의 해결과제는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems of the present invention are not limited to those mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

전술한 기술적 과제를 해결하기 위한 수단으로서, 본 발명의 실시 예에 따르면, 인공지능 기반의 사람 자세 추정 장치는, 원본 영상에서 사람을 포함하는 영역(이하, '원본 사람 영상'이라 한다)을 추출하는 추출부; 상기 추출부에서 추출되는 원본 사람 영상을 서로 다른 해상도를 갖도록 스케일링하여 다수의 타겟 사람 영상들을 생성하는 스케일링부; 및 상기 스케일링부에서 생성되는 다수의 타겟 사람 영상들을 사람 자세 추정을 위해 사전에 학습된 인공지능 모델의 서로 다른 네트워크로 입력하여 사람 자세를 추정하는 자세 추정부;를 포함할 수 있다.As a means for solving the above technical problem, according to an embodiment of the present invention, an artificial intelligence-based human posture estimation apparatus extracts a region including a person from an original image (hereinafter, referred to as 'original human image'). an extraction unit to; a scaling unit configured to generate a plurality of target human images by scaling the original human image extracted by the extraction unit to have different resolutions; and a posture estimator for estimating a human posture by inputting a plurality of target human images generated by the scaling unit to different networks of pre-learned artificial intelligence models to estimate a human posture.

상기 스케일링부는, 상기 인공지능 모델에 설정된 입력 영상 해상도에 기초하여 상기 다수의 타겟 사람 영상들을 생성할 수 있다.The scaling unit may generate the plurality of target human images based on an input image resolution set in the artificial intelligence model.

상기 스케일링부에서 생성되는 다수의 타겟 사람 영상들 중 하나 이상을 선별하여 참조 영상임을 의미하는 가이딩 채널을 추가하는 가이딩 채널 추가부;를 더 포함하고, 상기 자세 추정부는, 상기 추가되는 가이딩 채널에 기초하여 상기 원본 사람 영상과 유사한 하나 이상의 타겟 사람 영상을 인지 및 참조하여 사람 자세를 추정할 수 있다.A guiding channel adding unit which selects one or more of the plurality of target human images generated by the scaling unit and adds a guiding channel meaning a reference image, wherein the posture estimating unit performs the added guiding A human posture may be estimated by recognizing and referring to one or more target human images similar to the original human image based on the channel.

상기 자세 추정부는, 인공지능 모델에 구축된 다수의 스템 네트워크들 중 기설정된 스템 네트워크로 상기 다수의 타겟 사람 영상들이 입력되면, 상기 다수의 타겟 사람 영상들의 해상도를 다운 스케일링하면서 특징맵을 생성하고, 상기 생성된 특징맵들을 다수의 서브 네트워크들 중 해당하는 서브 네트워크로 입력하여 특징맵들의 해상도를 선택적으로 변경하거나 유지하면서 융합(fusion)하여 사람 자세를 추정할 수 있다.The posture estimator, when the plurality of target human images are input to a predetermined stem network among a plurality of stem networks built on an artificial intelligence model, generates a feature map while downscaling the resolution of the plurality of target human images, The generated feature maps may be input to a corresponding subnetwork among a plurality of subnetworks, and a human posture may be estimated by fusion while selectively changing or maintaining the resolution of the feature maps.

한편, 본 발명의 다른 실시 예에 따르면, 인공지능 기반의 사람 자세 추정 방법은, (A) 전자장치가, 원본 영상에서 사람을 포함하는 영역(이하, '원본 사람 영상'이라 한다)을 추출하는 단계; (B) 상기 전자장치가, 상기 (A) 단계에서 추출되는 원본 사람 영상을 서로 다른 해상도를 갖도록 스케일링하여 다수의 타겟 사람 영상들을 생성하는 단계; 및 (C) 상기 전자장치가, 상기 (B) 단계에서 생성되는 다수의 타겟 사람 영상들을 사람 자세 추정을 위해 사전에 학습된 인공지능 모델의 서로 다른 네트워크로 입력하여 사람 자세를 추정하는 단계;를 포함할 수 있다.Meanwhile, according to another embodiment of the present invention, an artificial intelligence-based human posture estimation method includes (A) an electronic device extracting a region including a person from an original image (hereinafter, referred to as 'original human image') step; (B) generating, by the electronic device, a plurality of target human images by scaling the original human image extracted in step (A) to have different resolutions; and (C) estimating, by the electronic device, a human posture by inputting the plurality of target human images generated in the step (B) to different networks of pre-learned artificial intelligence models for human posture estimation. can include

상기 (B) 단계는, 상기 인공지능 모델에 설정된 입력 영상 해상도에 기초하여 상기 다수의 타겟 사람 영상들을 생성할 수 있다.In the step (B), the plurality of target human images may be generated based on the input image resolution set in the artificial intelligence model.

상기 (B) 단계 이후, (D) 상기 전자장치가, 상기 (B) 단계에서 생성되는 다수의 타겟 사람 영상들 중 하나 이상을 선별하여 참조 영상임을 의미하는 가이딩 채널을 추가하는 단계;를 더 포함하고, 상기 (C) 단계는, 상기 (D) 단계에서 추가되는 가이딩 채널에 기초하여 상기 원본 사람 영상과 유사한 하나 이상의 타겟 사람 영상을 인지 및 참조하여 사람 자세를 추정할 수 있다.After the step (B), (D) selecting, by the electronic device, one or more of the plurality of target human images generated in the step (B) and adding a guiding channel indicating a reference image; In the step (C), the human posture may be estimated by recognizing and referring to one or more target human images similar to the original human image based on the guiding channel added in the step (D).

상기 (C) 단계는, (C1) 상기 다수의 타겟 사람 영상들을 인공지능 모델에 구축된 다수의 스템 네트워크들 중 기설정된 스템 네트워크로 입력하는 단계; (C2) 상기 다수의 스템 네트워크들이 각각 타겟 사람 영상의 해상도를 다운 스케일링하면서 특징맵을 생성하고, 상기 생성된 특징맵들을 다수의 서브 네트워크들 중 해당하는 서브 네트워크로 입력하는 단계; 및 (C3) 상기 다수의 서브 네트워크들이 상기 (C2) 단계에서 입력되는 특징맵들의 해상도를 선택적으로 변경하거나 유지하면서 융합(fusion)하여 사람 자세를 추정하는 단계;를 포함할 수 있다.The step (C) may include: (C1) inputting the plurality of target human images to a predetermined stem network among a plurality of stem networks built on an artificial intelligence model; (C2) each of the plurality of stem networks generating feature maps while downscaling the resolution of a target human image, and inputting the generated feature maps to a corresponding subnetwork among a plurality of subnetworks; and (C3) estimating a human posture by fusing the plurality of sub-networks while selectively changing or maintaining the resolution of the feature maps input in the step (C2).

본 발명에 따르면, 다양한 해상도의 영상에서 강건하게 동작하여 우수한 자세 추정 성능을 제공하며, 특히, 작은 크기의 사람을 포함하는 영상이나 저해상도 영상에서도 높은 자세 추정 성능을 제공하는 것이 가능하다.According to the present invention, it is possible to provide excellent posture estimation performance by operating robustly in images of various resolutions, and in particular, to provide high posture estimation performance even in an image including a small-sized person or a low-resolution image.

또한, 본 발명에 따르면, 계산 비용이 많이 발생하는 Super Resolution 모듈의 추가 없이도, 다중 스케일 입력 특징들의 융합과 가이딩 채널을 통해 2D 사람 자세 추정 성능을 향상시킬 수 있다. In addition, according to the present invention, 2D human posture estimation performance can be improved through the fusion of multi-scale input features and a guiding channel without the addition of a Super Resolution module, which is computationally expensive.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 실시 예에 따른 인공지능 기반의 사람 자세 추정 장치(100)를 도시한 블록도,
도 2는 다중 해상도 입력 및 가이딩 채널이 인공지능 모델의 네트워크로 입력되어 출력 특징(feature)이 출력되는 동작을 보여주는 도면,
도 3은 본 발명의 실시 예에 따른 HRNet의 프레임워크를 도시한 도면,
도 4는 본 발명의 실시 예에 따른 다양한 인공지능 모델의 변형 구조를 도시한 예시도,
도 5는 COCO 검증 세트의 영상들을 24Х18 사이즈로 스케일링한 결과를 보여주는 예시도, 그리고,
도 6은 발명의 실시 예에 따른 인공지능 기반의 사람 자세 추출 방법을 도시한 흐름도이다.1 is a block diagram showing an artificial intelligence-based human posture estimation apparatus 100 according to an embodiment of the present invention;
2 is a diagram showing an operation in which multi-resolution input and guiding channels are input to a network of an artificial intelligence model and output features are output;
3 is a diagram showing a framework of HRNet according to an embodiment of the present invention;
4 is an exemplary view showing a modified structure of various artificial intelligence models according to an embodiment of the present invention;
5 is an exemplary view showing the result of scaling images of a COCO verification set to a size of 24Х18, and
6 is a flowchart illustrating a method for extracting a human posture based on artificial intelligence according to an embodiment of the present invention.

이상의 본 발명의 목적들, 다른 목적들, 특징들 및 이점들은 첨부된 도면과 관련된 이하의 바람직한 실시 예들을 통해서 쉽게 이해될 것이다. 그러나 본 발명은 여기서 설명되는 실시 예들에 한정되지 않고 다른 형태로 구체화될 수도 있다. 오히려, 여기서 소개되는 실시 예들은 개시된 내용이 철저하고 완전해질 수 있도록 그리고 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 제공되는 것이다.The above objects, other objects, features and advantages of the present invention will be easily understood through the following preferred embodiments in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments described herein and may be embodied in other forms. Rather, the embodiments introduced herein are provided so that the disclosed content will be thorough and complete and the spirit of the present invention will be sufficiently conveyed to those skilled in the art.

어떤 경우에는, 발명을 기술하는 데 있어서 흔히 알려졌으면서 발명과 크게 관련 없는 부분들은 본 발명을 설명하는 데 있어 별 이유 없이 혼돈이 오는 것을 막기 위해 기술하지 않음을 미리 언급해 둔다.In some cases, it is mentioned in advance that parts that are commonly known in describing the invention and are not greatly related to the invention are not described in order to prevent confusion for no particular reason in explaining the present invention.

본 명세서에서 제1, 제2 등의 용어가 구성요소들을 기술하기 위해서 사용된 경우, 이들 구성요소들이 이 같은 용어들에 의해서 한정되어서는 안 된다. 이들 용어들은 단지 어느 구성요소를 다른 구성요소와 구별시키기 위해서 사용되었을 뿐이다. In this specification, when terms such as first and second are used to describe components, these components should not be limited by these terms. These terms are only used to distinguish one component from another.

또한, 어떤 구성요소가 구현됨에 있어서 특별한 언급이 없다면, 그 구성요소는 소프트웨어, 하드웨어, 또는 소프트웨어 및 하드웨어 어떤 형태로도 구현될 수 있는 것으로 이해되어야 할 것이다.In addition, it should be understood that, unless otherwise specified, the component may be implemented in any form of software, hardware, or both software and hardware.

또한, 본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 '포함한다(comprises)' 및/또는 '포함하는(comprising)'은 언급된 구성요소는 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.In addition, terms used in this specification are for describing embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. The terms 'comprises' and/or 'comprising' used in the specification do not exclude the presence or addition of one or more other elements.

또한, 본 명세서에서 '부', '플랫폼', '장치' 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되거나 하드웨어를 구동하기 위한 소프트웨어의 기능적, 구조적 결합을 지칭하는 것으로 의도될 수 있다. 예를 들어, 여기서 하드웨어는 CPU 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.Also, in this specification, terms such as 'unit', 'platform', and 'device' may be intended to refer to functional and structural combinations of hardware and software driven by or driving the hardware. For example, the hardware herein may be a data processing device including a CPU or other processor. Also, software driven by hardware may refer to a running process, an object, an executable file, a thread of execution, a program, and the like.

또한, 상기 용어들은 소정의 코드와 상기 소정의 코드가 수행되기 위한 하드웨어 리소스의 논리적인 단위를 의미할 수 있으며, 반드시 물리적으로 연결된 코드를 의미하거나, 한 종류의 하드웨어를 의미하는 것이 아님은 본 발명의 기술분야의 평균적 전문가에게는 용이하게 추론될 수 있다.In addition, the terms may mean a predetermined code and a logical unit of hardware resources for executing the predetermined code, and do not necessarily mean physically connected codes or one type of hardware in the present invention. can be easily deduced to the average expert in the art.

이하, 본 발명에서 실시하고자 하는 구체적인 기술내용에 대해 첨부도면을 참조하여 상세하게 설명하기로 한다.Hereinafter, with reference to the accompanying drawings for the specific technical content to be carried out in the present invention will be described in detail.

2D 사람 자세 추정에 대한 연구는 크게 Bottom-up 방식과 Top-down 방식으로 나뉜다. Studies on 2D human posture estimation are largely divided into bottom-up and top-down methods.

Bottom-up 방법의 경우 입력 영상에 대해 관절 후보를 구한 뒤, 각 관절 간의 상관관계를 분석하여 사람의 자세를 추정한다. Top-down 방식의 경우 입력 영상에서 사람을 먼저 검출한 뒤 검출된 사람을 포함하는 영역(bounding box)에서 관절을 추정한다. 본 발명의 실시 예는 자세 추정에 더 높은 성능을 보이는 Top-down 방식을 이용한다.In the case of the bottom-up method, joint candidates are obtained for the input image, and then the human posture is estimated by analyzing the correlation between each joint. In the case of the top-down method, a person is first detected in the input image, and then joints are estimated in the bounding box containing the detected person. An embodiment of the present invention uses a top-down method showing higher performance for posture estimation.

보다 자세히는, 본 발명의 실시 예는 Top-down 방식을 대표하는 HRNet의 구조를 기반으로 저해상도 영상에서도 높은 자세 추정 성능을 보이면서 모든 크기(즉, 모든 해상도)의 영상에 강건하게 동작할 수 있는 인공지능 모델을 이용할 수 있다. More specifically, an embodiment of the present invention is based on the structure of HRNet, which represents the top-down method, while showing high posture estimation performance even in low-resolution images, artificial intelligence that can operate robustly in images of all sizes (ie, all resolutions) Intelligence models are available.

기존 모델의 네트워크들은 자세 추정을 위해 원본 영상을 하나의 고정된 크기로 스케일링하여 인공지능 모델의 입력으로 사용한다. 이러한 방법은 고해상도 영상을 저해상도로 스케일링하는 경우에는 지나친 정보 손실이 발생하고, 저해상도 영상을 고해상도 영상으로 스케일링하는 경우에는 영상에 아티팩트(artifact)가 많이 생기게 되어 자세 추정 오류 가능성을 높이게 된다. The networks of the existing model scale the original image to a fixed size for posture estimation and use it as an input of the artificial intelligence model. In this method, when scaling a high-resolution image to a low resolution, excessive information loss occurs, and when scaling a low-resolution image to a high-resolution image, many artifacts are generated in the image, increasing the possibility of a posture estimation error.

따라서, 본 발명의 실시 예에서는 원본 영상 중 사람을 포함하는 영역을 멀티 해상도로 스케일링하고, 스케일링에 의해 생성되는 다수의 타겟 영상들을 자세 추정을 위한 인공지능 모델의 입력으로 사용할 수 있다. 또한, 본 발명의 실시 예는, 어떤 타겟 영상이 원본 영상의 해상도(즉, 크기)와 가장 유사한지를 인공지능 모델에 알릴 수 있도록, 즉, 인공지능 모델의 네트워크에서 가장 유사한 타겟 영상을 인지 및 자세 추정에 참조할 수 있도록 가이딩 정보를 제공할 수 있다. 이로써, 하나의 네트워크가 다양한 해상도의 영상에서도 강건하게 효과적으로 동작할 수 있다.Therefore, in an embodiment of the present invention, a region including a person in an original image may be scaled in multi-resolution, and a plurality of target images generated by the scaling may be used as an input of an artificial intelligence model for posture estimation. In addition, an embodiment of the present invention can inform the artificial intelligence model which target image is most similar to the resolution (ie, size) of the original image, that is, to recognize and pose the most similar target image in the network of artificial intelligence models. Guiding information can be provided for reference in estimation. As a result, one network can operate robustly and effectively even in images of various resolutions.

이를 위하여, 본 발명의 실시 예는 super-resolution 모듈과 같이 계산 비용이 많이 드는 모듈 없이 다수의 light convolution layers를 추가하여 2D 사람 자세 추정 성능을 향상시킬 수 있다.To this end, an embodiment of the present invention can improve 2D human posture estimation performance by adding a plurality of light convolution layers without a computationally expensive module such as a super-resolution module.

도 1은 본 발명의 실시 예에 따른 인공지능 기반의 사람 자세 추정 장치(100)를 도시한 블록도이고, 도 2는 다중 해상도 입력 및 가이딩 채널이 인공지능 모델의 네트워크로 입력되어 출력 특징(feature)이 출력되는 동작을 보여주는 도면이다. 1 is a block diagram showing an artificial intelligence-based human posture estimation apparatus 100 according to an embodiment of the present invention, and FIG. 2 shows output characteristics ( feature) is a diagram showing the output operation.

도 1을 참조하면, 본 발명의 실시 예에 따른 인공지능 기반의 사람 자세 추정 장치(100)는 추출부(110), 스케일링부(120), 채널 추가부(130) 및 자세 추정부(140)를 포함할 수 있다. Referring to FIG. 1 , an artificial intelligence-based human posture estimation apparatus 100 according to an embodiment of the present invention includes an extraction unit 110, a scaling unit 120, a channel addition unit 130, and a posture estimation unit 140. can include

추출부(110)는 원본 영상에서 사람을 포함하는 영역(이하, '원본 사람 영상'이라 한다, I)을 추출할 수 있다. 추출부(110)는 입력되는 원본 영상을 분석하여 사람을 검출하고, 검출된 사람을 포함하도록 바운딩 박스를 설정하여 크롭할 수 있다. 이하에서는 크롭된 바운딩 박스에 해당하는 영상을 원본 사람 영상(I)이라 한다. 원본 영상은 동영상 중 한 프레임이거나, 정지영상일 수 있다. The extractor 110 may extract a region including a person (hereinafter referred to as 'original person image', I) from the original image. The extractor 110 may detect a person by analyzing an input original image, and set and crop a bounding box to include the detected person. Hereinafter, an image corresponding to the cropped bounding box is referred to as an original human image I. The original image may be one frame of a moving image or a still image.

스케일링부(120)는 추출부(110)에서 추출되는 원본 사람 영상(I)을 서로 다른 해상도를 갖도록 스케일링하여 다수의 타겟 사람 영상들(I₁, I_n)을 생성할 수 있다. 자세히 설명하면, 스케일링부(120)는 높이(h) Х 너비(w)로 이루어진 원본 사람 영상(I)을 다중 해상도 입력 영상인 타겟 사람 영상들(I₁, I₂, …, I_n, …, I_N)로 스케일링할 수 있다. 여기서, 스케일링된 영상들(I₁, I₂, …, I_n, …, I_N)의 크기는 각각 (h₁Хw₁), (h₂Хw₂), …, (h_NХw_N)이다. 타겟 사람 영상들(I₁, I₂, …, I_n, …, I_N)은 원본 사람 영상(I)의 Bicubic Interpolation(고등차수 보간)에 의해 생성될 수 있다.The scaling unit 120 may generate a plurality of target human images I ₁ and I _n by scaling the original human image I extracted by the extractor 110 to have different resolutions. In detail, the scaling unit 120 converts the original human image I consisting of height h Х width w to target human images I ₁ , I ₂ , ..., I _n , ... which are multi-resolution input images. , I _N ). Here, the sizes of the scaled images (I ₁ , I ₂ , …, I _n , …, I _N ) are (h ₁ Хw ₁ ), (h ₂ Хw ₂ ), … , (h _N Хw _N ). The target human images I ₁ , I ₂ , ..., I _n , ..., I _N may be generated by bicubic interpolation (higher order interpolation) of the original human image I.

스케일링부(120)는 인공지능 모델에 설정된 다수의 서로 다른 입력 영상 해상도에 기초하여 다수의 타겟 사람 영상들(I₁, I_n)을 생성할 수 있다. The scaling unit 120 may generate a plurality of target human images I ₁ and I _n based on a plurality of different input image resolutions set in the artificial intelligence model.

따라서, 타겟 사람 영상들(I₁, I_n)의 해상도는 원본 사람 영상(I)의 해상도보다 작을 수도 있고 클 수도 있다. 도 2의 경우, 타겟 사람 영상(I₁)의 해상도는 원본 사람 영상(I)보다 크고, 타겟 사람 영상들(I_n)의 해상도는 원본 사람 영상(I)보다 작다. Accordingly, the resolution of the target human images I ₁ and I _n may be smaller or larger than that of the original human image I. In the case of FIG. 2 , the resolution of the target human image I ₁ is greater than that of the original human image I, and the resolution of the target human images I _n is smaller than that of the original human image I.

스케일링부(120)에서 스케일링할 해상도는 인공지능 모델의 구축하는 과정에서 다수의 데이터셋을 학습하면서 실험에 의해 최적의 값으로 정해질 수 있다. 따라서, 스케일링부(120)는 인공지능 모델과 연동하여 인공지능 모델에 설정된 다수의 해상도를 사전에 확인하여 다수의 타겟 사람 영상들을 생성하거나, 사용자 인터페이스(미도시)를 통해 사용자로부터 입력되는 해상도(즉, 인공지능 모델에 설정된 해상도)로 다수의 타겟 사람 영상들을 생성할 수도 있다. The resolution to be scaled by the scaling unit 120 may be determined as an optimal value through an experiment while learning a plurality of datasets in the process of constructing an artificial intelligence model. Therefore, the scaling unit 120 interlocks with the artificial intelligence model to check a plurality of resolutions set in the artificial intelligence model in advance to generate a plurality of target human images, or a resolution input from a user through a user interface (not shown) ( That is, a plurality of target human images may be generated at a resolution set in the artificial intelligence model).

채널 추가부(130)는 스케일링부(120)에서 생성되는 다수의 타겟 사람 영상들(I₁, I_n) 중 원본 사람 영상(I)의 해상도에 가장 근접하거나 사전에 정해진 해상도 범위에 속하는 하나 이상의 타겟 사람 영상을 선별하여 참조 영상임을 의미하는 가이딩 채널을 추가할 수 있다. The channel adder 130 may include one or more target human images I ₁ and I _n generated by the scaling unit 120 that are closest to the resolution of the original human image I or belong to a predetermined resolution range. A target human image may be selected and a guiding channel indicating a reference image may be added.

가이딩 채널은 원본 사람 영상과 가장 유사한 크기의 타겟 사람 영상을 네트워크(또는 인공지능 모델)가 인지 및 참고하도록 추가되는 채널(C_G1)과, 해당 타겟 사람 영상은 참고 영상이 아님을 의미하는 채널(C_G0)을 포함할 수 있다. 따라서, 가이딩 채널(C_G1)이 추가된 타겟 사람 영상은 네트워크에서 사람 자세 추정을 위한 참고 영상으로 활용될 수 있다. The guiding channel is a channel (C _G1 ) added so that the network (or artificial intelligence model) recognizes and references a target human image having the most similar size to the original human image, and a channel indicating that the target human image is not a reference image (C _G0 ). Accordingly, the target human image to which the guiding channel C _G1 is added may be used as a reference image for estimating a human posture in the network.

도 2를 참조하면, 하나의 타겟 사람 영상(I₁)은 R채널(C_R), G채널(C_G), B 채널(C_B)로 이루어진다. 채널 추가부(130)는 RGB 3개의 채널에 가이딩 채널(C_G0 또는 C_G1)을 추가할 수 있다. 가이딩 채널은 N>2인 경우, 2진값으로 채워지는 채널로서, 타겟 사람 영상의 네번째 채널에 추가될 수 있다. 일 예로, 가이딩 채널(C_G0)은 각 픽셀마다 표시 인자 '0'이 표시되고, 가이딩 채널(C_G1)에는 표시 인자 '1'이 표시된다. Referring to FIG. 2 , one target human image I ₁ is composed of an R channel (C _R ), a G channel (C _G ), and a B channel (C _B ). The channel adder 130 may add a guiding channel (C _G0 or C _G1 ) to the three RGB channels. The guiding channel is a channel filled with binary values when N>2, and may be added to the fourth channel of the target human image. For example, the display factor '0' is displayed for each pixel in the guiding channel (C _G0 ), and the display factor '1' is displayed in the guiding channel (C _G1 ).

자세히 설명하면, 채널 추가부(130)는 타겟 사람 영상의 해상도가 사전에 정의된 범위에 해당하면 표시 인자 '1'을 가지는 가이딩 채널(C_G1)을 생성한 후 추가하고, 범위를 벗어나면 표시 인자 '0'을 가지는 가이딩 채널(C_G0)을 추가할 수 있다. 가이딩 채널을 추가하는 해상도 조건은 [표 1]을 참조하여 후술한다.In detail, the channel adder 130 creates and adds a guiding channel (C _G1 ) having a display factor of '1' when the resolution of the target human image corresponds to a predefined range, and adds it when the resolution of the target human image is out of the range A guiding channel (C _GO ) having a display factor of '0' may be added. Resolution conditions for adding a guiding channel will be described later with reference to [Table 1].

또는, 채널 추가부(130)는 다수의 타겟 사람 영상들(I₁, I_n) 각각의 해상도와 원본 사람 영상(I)의 해상도를 비교하여, 해상도의 차이가 가장 적은 하나의 타겟 사람 영상(예를 들어, In)에게 표시 인자 '1'을 가지는 가이딩 채널(C_G1)을 생성한 후 추가하고, 나머지 타겟 사람 영상에는 표시 인자 '0'을 가지는 가이딩 채널(C_G0)을 추가할 수 있다. Alternatively, the channel adder 130 compares the resolution of each of the plurality of target human images (I ₁ , I _n ) with the resolution of the original human image (I), and selects one target human image having the smallest resolution difference ( For example, a guiding channel (C _G1 ) having a display factor of '1' is created and added to In), and a guiding channel (C _G0 ) having a display factor of '0' is added to the remaining target human images. can

또는, 채널 추가부(130)는 임계값 내에 있는 후보 타겟 사람 영상이 다수인 경우, 다수의 후보 타겟 사람 영상들 중 사전에 정해진 개수(1 이상)만큼 가장 근접한 순서대로 타겟 사람 영상을 추출하여 표시 인자 '1'을 가지는 가이딩 채널(C_G1)을 추출된 타겟 사람 영상에 추가하고, 나머지 타겟 사람 영상에는 표시 인자 '0'을 가지는 가이딩 채널(C_G0)을 추가할 수도 있다. Alternatively, when there are multiple candidate target human images within the threshold value, the channel adding unit 130 extracts and displays target human images in the order closest to a predetermined number (1 or more) of the plurality of candidate target human images. A guiding channel (C _G1 ) having a factor of '1' may be added to the extracted target human image, and a guiding channel (C _G0 ) having a display factor of '0' may be added to the remaining target human images.

또한, 채널 추가부(130)는 타겟 사람 영상(I₁)의 해상도는 원본 사람 영상(I)보다 크고, 타겟 사람 영상들(I_n)의 해상도는 원본 사람 영상(I)보다 작으며, I₁과 I의 해상도 차이와, I_n과 I의 해상도 차이가 동일한 경우에는, 원본 사람 영상(I)이 기설정된 기준보다 큰 고해상도이면 타겟 사람 영상(I₁)에 표시 인자 '1'을 가지는 가이딩 채널을 추가하고, 기준보다 작은 저해상도이면 타겟 사람 영상(I_n)에 표시 인자 '0'을 가지는 가이딩 채널을 추가할 수도 있다.In addition, the channel adder 130 determines that the resolution of the target human image I ₁ is greater than that of the original human image I, the resolution of the target human images _{I n} is smaller than that of the original human image I, and that I When the resolution difference between ₁ and I and the resolution difference between I _n and I are the same, if the original human image I has a higher resolution than the preset standard, the target human image I ₁ has a display factor '1' A guiding channel may be added, and a guiding channel having a display factor of '0' may be added to _the target human image In if the resolution is smaller than the reference.

상술한 설명에 의하면, 채널 추가부(130)는 원본 사람 영상의 해상도에 따라 네트워크의 사람 추정에 선택적으로 영향을 줄 수 있도록 다중 해상도 영상들을 유도하는 가이딩 채널을 추가할 수 있다. 이에 의해, 원본 사람 영상이 저해상도일 때, 네트워크는 열화가 거의 없는 작은 배율로 스케일된 타겟 사람 영상(또는 타겟 사람 영상의 특징)을 더 참조하여 자세를 추정하도록 유도될 수 있다. 반면, 원본 사람 영상이 고해상도인 경우에는 정보 손실이 거의 없는 큰 크기의 타겟 사람 영상(또는 타겟 사람 영상의 특징)에 의해 영향을 받아 자세를 추정하도록 유도될 수 있다.According to the above description, the channel adder 130 may add a guiding channel for inducing multi-resolution images to selectively affect estimation of a person in a network according to the resolution of an original human image. Thereby, when the original person image is of low resolution, the network may be guided to estimate the posture by further referring to the target person image (or the feature of the target person image) scaled at a small magnification with little degradation. On the other hand, if the original human image has a high resolution, it may be influenced by a target human image (or a feature of the target human image) of a large size with almost no loss of information and may be induced to estimate the posture.

상술한 채널 추가부(130)의 동작은 생략될 수도 있으며, 생략 여부는 사용자에 의해 정해질 수 있다.The above-described operation of the channel adder 130 may be omitted, and whether or not to be omitted may be determined by the user.

자세 추정부(140)는 스케일링부(120)에서 생성되는 다수의 타겟 사람 영상들을 사람 자세 추정을 위해 사전에 학습된 인공지능 모델의 서로 다른 네트워크로 입력하여 사람 자세를 추정할 수 있다. The posture estimator 140 may estimate a human posture by inputting a plurality of target human images generated by the scaling unit 120 to different networks of pre-learned artificial intelligence models for human posture estimation.

또한, 자세 추정부(140)는 채널 추가부(130)에서 가이딩 채널이 타겟 사람 영상들에 추가된 경우, 가이딩 채널에 기초하여 다수의 타겟 사람 영상들(I₁, I_n) 중 원본 사람 영상(I)에 가장 근접한 하나 이상의 타겟 사람 영상을 인지 및 참조하여 사람 자세를 추정할 수 있다. In addition, when the guiding channel is added to the target human images by the channel adding unit 130, the posture estimator 140 selects an original image among a plurality of target human images I ₁ and I _n based on the guiding channel. A human posture may be estimated by recognizing and referring to one or more target human images closest to the human image I.

또한, 자세 추정부(140)는 인공지능 모델에 구축된 다수의 스템 네트워크들 중 기설정된 스템 네트워크로 다수의 타겟 사람 영상들(I₁, I_n)이 입력되면, 다수의 타겟 사람 영상들(I₁, I_n)의 해상도를 다운 스케일링하면서 특징맵을 생성하고, 생성된 특징맵들을 다수의 서브 네트워크들 중 해당하는 서브 네트워크로 입력하여 특징맵들의 해상도를 선택적으로 변경하거나 유지하면서 융합(fusion)하여 사람 자세를 추정할 수 있다. 특징맵 또는 특징맵의 특징은 사람의 윤곽선, 형태, 명암, 색상 등 사람을 형상화하거나 자세 추정에 사용되는 정보를 포함할 수 있다.In addition, when a plurality of target human images I ₁ and I _n are input to a predetermined stem network among a plurality of stem networks built on an artificial intelligence model, the posture estimator 140 generates a plurality of target human images ( While downscaling the resolutions of I ₁ and I _n , feature maps are created, and the generated feature maps are input to a corresponding subnetwork among a plurality of subnetworks to selectively change or maintain the resolution of the feature maps while fusion (fusion). ) to estimate the human posture. The feature map or features of the feature map may include information used to shape a person or to estimate a posture, such as a person's outline, shape, contrast, and color.

도 2를 참조하여 자세 추정부(140)가 인공지능 모델을 이용하여 자세를 추정하는 동작을 자세히 설명한다.Referring to FIG. 2 , an operation of the posture estimating unit 140 estimating a posture using an artificial intelligence model will be described in detail.

도 2에 도시된 인공지능 모델은 다수의 스템(STEM) 네트워크들과 HRNet을 포함할 수 있다. 스템 네트워크는 타겟 사람 영상(In)으로부터 영상 특징을 추출하는 최소 두 개의 strided convolution 레이어를 포함하고, HRNet은 다수의 서브 네트워크들을 포함할 수 있다. 다중 해상도 영상들, 즉, 다수의 타겟 사람 영상들(I₁, I_n)의 영상 특징은 스템 네트워크를 통해 추출되어 사람의 관절 위치를 추정하는데 사용될 수 있다. HRNet은 다수의 서브 네트워크들을 포함하며, n번째 서브 네트워크에서 추출되는 영상 특징은 (n-1)번째 서브 네트워크에서 다운스케일된 특징들에 추가되어 다음 네트워크로 삽입될 수 있다. HRNet의 동작은 도 3을 참조하여 후술한다. The artificial intelligence model shown in FIG. 2 may include a plurality of STEM networks and HRNet. The stem network includes at least two strided convolution layers that extract image features from the target human image In, and the HRNet may include a plurality of sub-networks. Image features of multi-resolution images, ie, multiple target human images I ₁ and I _n , may be extracted through a stem network and used to estimate human joint positions. HRNet includes a plurality of sub-networks, and image features extracted from the n-th sub-network may be added to downscaled features from the (n-1)-th sub-network and inserted into the next network. The operation of HRNet will be described later with reference to FIG. 3 .

도 2에서, 타겟 사람 영상(I₁)은 제1스템 네트워크(SN₁)로 입력되고, 타겟 사람 영상(I_n)은 제n스템 네트워크(SN_n)로 입력된다. 제1스템 네트워크(SN₁)는 입력된 타겟 사람 영상(I₁)에서 특징을 추출하고, 추출된 타겟 사람 영상(I₁)의 특징은 HRNet의 제1서브 네트워크의 제1스테이지(S₁₁)로 입력되어 다시 특징을 추출하고, 추출된 특징은 컨볼루션 레이어에 의해 다운스케일링되거나 유지될 수 있다.In FIG. 2 , a target human image I ₁ is input to a first stem network SN ₁ , and a target human image I _n is input to an n th system network SN _n . The first stem network (SN ₁ ) extracts features from the input target human image (I ₁ ), and the features of the extracted target human image (I ₁ ) are transferred to the first stage (S ₁₁ ) of the first subnetwork of HRNet. , and features are extracted again, and the extracted features may be downscaled or maintained by the convolution layer.

또한, 제n스템 네트워크(SN_n)는 입력되는 타겟 사람 영상(I_n)에서 특징을 추출하고, 추출된 타겟 사람 영상(I_n)의 특징은 제n서브 네트워크의 제1스테이지(S_n1)로 입력된다. 이 때, 상위 서브 네트워크(예를 들어, 제1서브 네트워크)의 중간 레이어로부터 출력되는 특징은 제n스템 네트워크에서 추출되는 특징과 융합되며, 각 특징의 차원을 일치시키기 위해 상위 서브 네트워크로부터의 특징맵은 다운스케일링된 후 융합될 수 있다. In addition, the nth-stem network (SN _n ) extracts features from the input target human image (I _n ), and the features of the extracted target human image (I _n ) are transferred to the first stage (S _n1 ) of the nth sub-network. is entered as At this time, the features output from the middle layer of the upper sub-network (eg, the first sub-network) are fused with the features extracted from the n-stem network, and the features from the upper sub-network are matched to match the dimensions of each feature. Maps can be downscaled and then fused.

(n-1)번째 서브 네트워크로부터 출력되는 특징맵이 n번째 서브 네트워크의 입력으로 융합된다고 가정하면, S_n1의 입력 특징은 [수학식 1]과 같이 표현될 수 있다.Assuming that the feature map output from the (n-1)th subnetwork is fused with the input of the nth subnetwork, the input feature of S _n1 can be expressed as [Equation 1].

[수학식 1]을 참조하면, stride=2는 (n-1)번째 서브 네트워크의 (n-1)번째 스테이지에서 출력되는 특징은 컨볼루션 레이어에 의해 특징맵의 해상도가 50% 감소된 후, n번째 스템 네트워크에서 출력되는 특징과 융합되는 것을 의미한다. Referring to [Equation 1], stride=2 is the feature output from the (n-1)th stage of the (n-1)th subnetwork after the resolution of the feature map is reduced by 50% by the convolution layer, It means that it is fused with the feature output from the nth stem network.

도 3은 본 발명의 실시 예에 따른 HRNet의 프레임워크를 도시한 도면이다.3 is a diagram illustrating a framework of HRNet according to an embodiment of the present invention.

도 3을 참조하면, 2D 자세 추정 네트워크에서 HRNet은 레이어들을 직렬과 병렬로 구성하여 고성능의 자세 추정을 제공한다. HRNet에서 다수의 high-to-low subnetworks는 병렬로 연결되고, 각 high-to-low subnetwork에는 다수의 스테이지가 직렬로 연결된다. n번째 서브 네트워크의 k번째 스테이지는 S_nk로 표시된다. Referring to FIG. 3 , in a 2D pose estimation network, HRNet configures layers in series and parallel to provide high-performance pose estimation. In HRNet, multiple high-to-low subnetworks are connected in parallel, and multiple stages are connected in series in each high-to-low subnetwork. The kth stage of the nth subnetwork is denoted by S _nk .

각 서브 네트워크는 서로 다른 해상도의 타겟 사람 영상을 입력받을 수 있다.Each subnetwork may receive target human images of different resolutions.

각 스테이지는 서로 다른 서브 네트워크들로부터 특징을 집계하는 multi-scale fusion process를 포함한다. Each stage involves a multi-scale fusion process that aggregates features from different sub-networks.

S_nk의 입력 특징은 S_1(n+k-2),S_2(n+k-2), …, S_(n+k-2)1의 출력 특징을 집계할 수 있다. 예를 들어, S₁₃, S₂₂ 및 S₃₁에서 출력된 특징은 결합되어 스테이지 S₂₃으로 삽입된다. The input features of S _nk are S _1(n+k-2), S _2(n+k-2), … , we can aggregate the output features of S _(n+k-2)1 . For example, the features output in S ₁₃ , S ₂₂ and S ₃₁ are combined and inserted into stage S ₂₃ .

Strided 컨볼루션 또는 업-샘플링은 통합할 특징맵들의 크기를 매치시키기 위해 사용된다. S_nk의 입력 특징을 생성하기 위해, 본 발명의 실시 예에서는 Strided 컨볼루션에 의해 S(n-1)k의 출력 특징의 크기를 정해진 비율(예를 들어, 50%)로 줄인 후 하위 서브 네트워크로 출력할 수 있다. 스테이지 S_(n+1)(k-2)의 출력 특징의 크기는 최근접(nearest neighbor) 업샘플링과 1Х1 컨볼루션을 통해 2배가 된다. S_n(k-1)에서 특징맵은 크기 변경 없이 컨볼루션 레이어를 통과한다. S_nk의 입력 특징은 동일한 크기의 특징맵들을 더하여 생성된다. Strided convolution or up-sampling is used to match the size of the feature maps to be integrated. In order to generate the input feature of S _nk , in an embodiment of the present invention, after reducing the size of the output feature of S (n-1) k by a predetermined ratio (eg, 50%) by strided convolution, the lower subnetwork can be output as The size of the output feature of stage S _(n+1)(k-2) is doubled through nearest neighbor upsampling and 1Х1 convolution. At S _n(k-1), the feature map passes through the convolutional layer without changing its size. The input feature of S _nk is created by adding feature maps of the same size.

관절 위치를 회귀(regress)하기 위한 관절 개수에 대응하는 히트맵들(M)은 제1서브 네트워크의 출력 특징에서 획득될 수 있다. 실제값에 해당하는 히트맵들은 각 관절의 실측 픽셀 위치에서 1픽셀의 표준 편차를 가지는 2D 가우시안을 적용하여 생성될 수 있다. 또한, 출력 표현(output representation)은 실측 히트맵들에 근접하도록 평균제곱오차(MSE) 손실을 사용하도록 훈련될 수 있다. Heat maps M corresponding to the number of joints for regressing joint positions may be obtained from output features of the first subnetwork. Heat maps corresponding to actual values may be generated by applying 2D Gaussian with a standard deviation of 1 pixel at the measured pixel position of each joint. Additionally, the output representation can be trained to use mean square error (MSE) loss to approximate ground truth heatmaps.

상술한 동작에 의해, 자세 추정부(140)는 다수의 스템 네트워크들과 HRNet을 포함하는 인공지능 모델에 다중 해상도 영상들(I₁, I_n)을 입력하여 사람 자세를 추정한 17개의 관절을 표시하는 17개의 히트맵들(M)을 출력하고, 17개의 히트맵들(M)로부터 자세 추정 결과를 보여주는 결과 영상을 출력할 수 있다. Through the above-described operation, the posture estimator 140 inputs multi-resolution images (I ₁ , _In ) to an artificial intelligence model including a plurality of stem networks and HRNet, and calculates 17 joints in which human postures are estimated. It is possible to output 17 heat maps (M) to be displayed, and to output a resultant image showing attitude estimation results from the 17 heat maps (M).

도 4는 본 발명의 실시 예에 따른 다양한 인공지능 모델의 변형 구조를 도시한 예시도이다.4 is an exemplary view showing a modified structure of various artificial intelligence models according to an embodiment of the present invention.

도 4의 (a) 내지 (c)를 참조하면, I₁의 해상도는 모두 256Х192로 설정되어 있다. 또한, 하위 네트워크에는 낮은 해상도(74Х48, 32Х24 또는 128Х96)로 조정된 I₂ 또는 I₃이 삽입된다. 따라서, 인공지능 모델이 (a)의 구조를 갖는 경우, 스케일링부(120)는 원본 입력 영상을 각각 256Х192와 64Х48의 해상도를 가지는 타겟 사람 영상들(I₁, I₂)로 스케일링한다. 자세 추정부(140)는 원본 사람 영상(I)의 해상도에 따라서 도 4에 도시된 다수의 변형된 인공지능 모델들 중 하나로 구조를 변경하여 자세를 추정할 수도 있다. Referring to (a) to (c) of FIG. 4 , all resolutions of I ₁ are set to 256Х192. In addition, a scaled I ₂ or I ₃ with a lower resolution (74Х48, 32Х24 or 128Х96) is inserted into the sub-network. Accordingly, when the artificial intelligence model has the structure of (a), the scaling unit 120 scales the original input image to target human images I ₁ and I ₂ having resolutions of 256Х192 and 64Х48, respectively. The posture estimator 140 may estimate the posture by changing the structure to one of a plurality of modified artificial intelligence models shown in FIG. 4 according to the resolution of the original human image I.

[표 1]은 도 4에 도시된 (a) 내지 (c)의 인공지능 모델에서 적용되는 가이딩 채널 조건을 보여준다.[Table 1] shows the guiding channel conditions applied in the artificial intelligence models (a) to (c) shown in FIG. 4.

I_n
(height x width)I_n
(height x width) Stem Output
(height x width)Stem Output
(height x width) Guiding Channel
(condition)Guiding Channel
(condition) OURS-(a)OURS-(a)

(256Х192)

(256Х192) 64Х48 0

(74Х48) 32Х24 1, if 96 96 < h w
0, otherwise OURS-(b)

(256Х192) 64Х48 0

(32Х24) 16Х12 1, if 96 96 < h w
0, otherwise OURS-(c)

(256Х192) 64Х48 00

(128Х96) 32Х24 1, if 64 64 < h w < 128 128
0, otherwise

(74Х48) 16Х12 1, if h w < 64 640, otherwise

도 4의 (a) 및 [표 1]을 참조하면, 채널 추가부(130)는 타겟 사람 영상(I₂)의 해상도가 96Х96보다 작으므로, 표시 인자 '1'을 가지는 가이딩 채널을 타겟 사람 영상(I₂)의 네번째 채널로서 생성 및 추가하여, 참조 영상임을 표시할 수 있다. 또한, 타겟 사람 영상(I₁)에는 표시 인자 '0'을 가지는 가이딩 채널을 추가하여 참조 영상이 아님을 표시할 수 있다. Referring to (a) of FIG. 4 and [Table 1], since the resolution of the target person image I ₂ is smaller than 96Х96, the channel adder 130 converts the guiding channel having the display factor '1' to the target person. It can be created and added as a fourth channel of the image I ₂ to indicate that it is a reference image. In addition, a guiding channel having a display factor '0' may be added to the target human image I ₁ to indicate that it is not a reference image.

[표 1]에 기재된 해상도 별 가이딩 채널 조건은 일 예로서, 인공지능 모델의 구조에 따라 또는 자세 추정 성능에 따라서 변경가능하다.As an example, the guiding channel condition for each resolution described in [Table 1] can be changed according to the structure of the artificial intelligence model or the posture estimation performance.

도 5는 COCO 검증 세트의 영상들을 24Х18 사이즈로 스케일링한 결과를 보여주는 예시도이다.5 is an exemplary view showing a result of scaling images of a COCO verification set to a size of 24Х18.

도 5에서 기존 방식은 기존에 입력 영상을 고정된 하나의 크기로 스케일링하여 자세를 추정하는 방식을 의미하고, 본 발명은 상술한 본 발명의 실시 예에 따라 다중 해상도 영상을 이용하여 자세를 추정하는 방식이고, 실제값은 COCO 검증 세트에 포함된 실제 원본 영상에서 추정된 자세를 의미한다. In FIG. 5, the conventional method refers to a method of estimating a posture by scaling an input image to one fixed size, and the present invention is a method of estimating a posture using a multi-resolution image according to an embodiment of the present invention described above. method, and the actual value means the pose estimated from the actual original image included in the COCO verification set.

도 5를 참조하면, 본 발명의 실시 예에 따른 자세 추정은 기존 방식으로 추정된 기준선(Baseline)보다 더 정확하게 관절 위치를 추정하고, 골격 역시 기준선에 비해 실제와 더 동일하거나 거의 유사한 자세를 표현한다. Referring to FIG. 5 , posture estimation according to an embodiment of the present invention estimates joint positions more accurately than a baseline estimated by the conventional method, and the skeleton also expresses a posture that is more identical or almost similar to the actual one compared to the baseline. .

도 6은 발명의 실시 예에 따른 인공지능 기반의 사람 자세 추출 방법을 도시한 흐름도이다.6 is a flowchart illustrating a method for extracting a human posture based on artificial intelligence according to an embodiment of the present invention.

도 6에 도시된 인공지능 기반의 사람 자세 추출 방법을 수행하는 전자장치는 도 1 내지 도 5를 참조하여 설명한 사람 자세 추정 장치(100)일 수 있으므로 상세한 설명은 생략한다. An electronic device for performing the artificial intelligence-based human posture extraction method shown in FIG. 6 may be the human posture estimating apparatus 100 described with reference to FIGS. 1 to 5 , and thus a detailed description thereof will be omitted.

도 6을 참조하면, 전자장치는 입력되는 원본 영상에서 사람을 포함하는 원본 사람 영상을 추출할 수 있다(S610).Referring to FIG. 6 , the electronic device may extract an original human image including a person from an input original image (S610).

전자장치는 S610단계에서 추출되는 원본 사람 영상을 서로 다른 해상도를 갖도록 스케일링하여 다수의 타겟 사람 영상들을 생성할 수 있다(S620). The electronic device may generate a plurality of target human images by scaling the original human image extracted in step S610 to have different resolutions (S620).

전자장치는 S620단계에서 생성되는 다수의 타겟 사람 영상들 중 원본 사람 영상의 해상도에 가장 근접하거나 사전에 정해진 해상도 범위 내에 속하는 하나 이상의 타겟 사람 영상을 선별하여 참조 영상임을 의미하는 가이딩 채널 또는 참조 영상이 아님을 의미하는 가이딩 채널을 추가할 수 있다(S630).The electronic device selects one or more target human images that are closest to the resolution of the original human image or within a predetermined resolution range among the plurality of target human images generated in step S620, and selects a guiding channel or reference image, which means a reference image. It is possible to add a guiding channel indicating that it is not (S630).

전자장치는 가이딩 채널이 추가된 다수의 타겟 사람 영상들을 사람 자세 추정을 위해 사전에 학습된 인공지능 모델의 서로 다른 계층의 네트워크로 입력하여 사람 자세를 추정할 수 있다(S640).The electronic device may estimate a human posture by inputting a plurality of target human images to which a guiding channel is added to a network of different layers of a pre-learned artificial intelligence model for human posture estimation (S640).

S640단계는, 인공지능 모델에 구축된 다수의 스템 네트워크들 중 기설정된 스템 네트워크로 다수의 타겟 사람 영상들이 입력되면, 다수의 타겟 사람 영상들의 해상도를 다운 스케일링하면서 특징맵을 생성하고, 생성된 특징맵들을 다수의 서브 네트워크들 중 해당하는 서브 네트워크로 입력하여 특징맵들의 해상도를 선택적으로 변경하거나 유지하면서 융합(fusion)하여 사람 자세를 추정할 수 있다.In step S640, when a plurality of target human images are input to a predetermined stem network among a plurality of stem networks built on the artificial intelligence model, a feature map is generated while the resolution of the plurality of target human images is downscaled, and the generated features A human posture may be estimated by inputting the maps to a corresponding subnetwork among a plurality of subnetworks and fusion while selectively changing or maintaining the resolution of the feature maps.

도 7은 본 발명의 실시 예에 따른 인공지능 기반 사람 자세 추정 방법을 하는 컴퓨팅 시스템을 보여주는 블록도이다.7 is a block diagram illustrating a computing system performing an artificial intelligence-based human posture estimation method according to an embodiment of the present invention.

도 7을 참조하면, 컴퓨팅 시스템(700)은 버스(720)를 통해 연결되는 적어도 하나의 프로세서(710), 메모리(730), 사용자 인터페이스 입력 장치(740), 사용자 인터페이스 출력 장치(750), 스토리지(760), 및 네트워크 인터페이스(770)를 포함할 수 있으며, 이는 도 6에서 설명한 전자장치 또는 도 1에서 설명한 사람 자세 추정 장치(100)일 수 있다. Referring to FIG. 7 , a computing system 700 includes at least one processor 710, a memory 730, a user interface input device 740, a user interface output device 750, and a storage connected through a bus 720. 760 , and a network interface 770 , which may be the electronic device described in FIG. 6 or the apparatus 100 for estimating a human posture described in FIG. 1 .

프로세서(710)는 중앙 처리 장치(CPU) 또는 메모리(730) 및/또는 스토리지(760)에 저장된 명령어들에 대한 처리를 실행하는 반도체 장치일 수 있다. 메모리(730) 및 스토리지(760)는 다양한 종류의 휘발성 또는 비휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리(730)는 ROM(Read Only Memory)(731) 및 RAM(Random Access Memory)(732)을 포함할 수 있다. The processor 710 may be a central processing unit (CPU) or a semiconductor device that processes commands stored in the memory 730 and/or the storage 760 . The memory 730 and the storage 760 may include various types of volatile or non-volatile storage media. For example, the memory 730 may include a read only memory (ROM) 731 and a random access memory (RAM) 732 .

따라서, 본 명세서에 개시된 실시 예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서(710)에 의해 실행되는 하드웨어, 소프트웨어 모듈, 또는 그 2 개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리(730) 및/또는 스토리지(760))에 상주할 수도 있다. 예시적인 저장 매체는 프로세서(710)에 커플링되며, 그 프로세서(710)는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서(710)와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.Accordingly, steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented as hardware executed by the processor 710, a software module, or a combination of the two. A software module resides in a storage medium (i.e., memory 730 and/or storage 760) such as RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM. You may. An exemplary storage medium is coupled to the processor 710, and the processor 710 can read information from, and write information to, the storage medium. Alternatively, the storage medium may be integral with the processor 710 . The processor and storage medium may reside within an application specific integrated circuit (ASIC). An ASIC may reside within a user terminal. Alternatively, the processor and storage medium may reside as separate components within a user terminal.

한편, 이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시 예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서, 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주하여야 할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.On the other hand, although the above has been described and illustrated in relation to preferred embodiments for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described in this way, and departs from the scope of the technical idea. It will be apparent to those skilled in the art that many changes and modifications can be made to the present invention without modification. Accordingly, all such appropriate alterations and modifications and equivalents are to be regarded as falling within the scope of the present invention. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the attached claims.

100: 사람 자세 추정 장치 110: 추출부
120: 스케일링부 130: 채널 추가부
140: 자세 추정부100: human posture estimation device 110: extraction unit
120: scaling unit 130: channel addition unit
140: posture estimation unit

Claims

an extraction unit for extracting a region including a person from the original image (hereinafter, referred to as 'original human image');
a scaling unit configured to generate a plurality of target human images by scaling the original human image extracted by the extraction unit to have different resolutions; and
a posture estimator for estimating a human posture by inputting the plurality of target human images generated by the scaling unit to different networks of pre-learned artificial intelligence models to estimate a human posture;
An artificial intelligence-based human posture estimation device comprising a.

According to claim 1,
The scaling unit,
The artificial intelligence-based human posture estimation device, characterized in that for generating the plurality of target human images based on the input image resolution set in the artificial intelligence model.

According to claim 1,
a guiding channel adding unit selecting one or more of the plurality of target human images generated by the scaling unit and adding a guiding channel indicating a reference image;
Including more,
The posture estimation unit,
and estimating a human posture by recognizing and referring to one or more target human images similar to the original human image based on the added guiding channel.

According to claim 1,
The posture estimation unit,
When a plurality of target human images are input to a predetermined stem network among a plurality of stem networks built on an artificial intelligence model, a feature map is generated while the resolution of the plurality of target human images is downscaled, and the generated feature map is generated. An artificial intelligence-based human posture estimating device, characterized by estimating a human posture by inputting them into a corresponding subnetwork among a plurality of subnetworks and selectively changing or maintaining the resolution of feature maps and fusion them.

(A) extracting, by an electronic device, a region including a person (hereinafter, referred to as an 'original person image') from an original image;
(B) generating, by the electronic device, a plurality of target human images by scaling the original human image extracted in step (A) to have different resolutions; and
(C) estimating, by the electronic device, a human posture by inputting a plurality of target human images generated in step (B) to different networks of pre-learned artificial intelligence models for human posture estimation;
Artificial intelligence-based human posture extraction method including.

According to claim 5,
In step (B),
An artificial intelligence-based human posture extraction method, characterized in that for generating the plurality of target human images based on the input image resolution set in the artificial intelligence model.

According to claim 5,
After step (B),
(D) selecting, by the electronic device, one or more of the plurality of target human images generated in the step (B) and adding a guiding channel indicating a reference image;
Including more,
In the step (C), the human posture is estimated by recognizing and referring to one or more target human images similar to the original human image based on the guiding channel added in the step (D). How to extract human posture.

According to claim 5,
In step (C),
(C1) inputting the plurality of target human images to a predetermined stem network among a plurality of stem networks built on an artificial intelligence model;
(C2) each of the plurality of stem networks generating feature maps while downscaling the resolution of a target human image, and inputting the generated feature maps to a corresponding subnetwork among a plurality of subnetworks; and
(C3) estimating a human posture by fusing the plurality of sub-networks while selectively changing or maintaining the resolution of the feature maps input in step (C2);
Artificial intelligence-based human posture extraction method comprising a.