KR102037997B1

KR102037997B1 - Electronic apparatus and method for generating contents

Info

Publication number: KR102037997B1
Application number: KR1020190008490A
Authority: KR
Inventors: 이윤재; 최성우; 이용건; 김산성
Original assignee: 한국방송공사
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-10-29

Abstract

An electronic device is disclosed. The electronic device includes: a communication device which receives a captured image; a memory which stores the inputted captured image; and a processor which extracts a face image included in each of a plurality of frames included in the inputted captured image, clusters the extracted face image into a plurality of people, generates track information for each of the people based on the face image clustered into a same person, and generates content for each of the people by using the inputted captured image and the generated track information. It is possible to generate the content automatically.

Description

ELECTRONIC APPARATUS AND METHOD FOR GENERATING CONTENTS}

본 개시는 전자 장치 및 콘텐츠 생성 방법에 대한 것으로, 보다 구체적으로는 AI 기술을 이용하여 촬영 영상 내의 인물별 콘텐츠를 자동으로 생성 가능한 전자 장치 및 콘텐츠 생성 방법에 관한 것이다.The present disclosure relates to an electronic device and a content generating method, and more particularly, to an electronic device and a content generating method capable of automatically generating content for each person in a captured image using AI technology.

최근에는 방송사가 직접 공중파를 통하여 제공한 콘텐츠뿐만 아니라, 일반 개인들이 생성한 다양한 형태의 콘텐츠가 유통되고 있다. 특히, 최근에는 가수의 특정 맴버 만을 촬영한 콘텐츠(이하, 직캠 영상)가 활발하게 유통되고 있다.Recently, various types of contents generated by ordinary individuals are being distributed, as well as contents directly provided by broadcasters through airwaves. In particular, recently, content (hereinafter referred to as direct cam video) photographing only a specific member of a singer has been actively distributed.

이러한 직캠 영상은 일반적으로 해당 가수의 공연을 관람하는 관객이 촬영한 것들이다 보니, 영상의 흔들림이 많고 무대 설비나 앞사람에 의하여 해당 멤버가 가려지는 등의 영상 퀄리티가 높지 않았다.These direct cam videos are generally taken by the audience watching the singer's performance, so there is a lot of shaking and the quality of the video is not high.

최근에는 방송사에서 직접 카메라 장비를 도입하고 촬영 감독을 고용하여 직캠 영상을 제공하는 경우가 있으나, 멤버별로 개별적인 카메라 및 촬영 감독이 필요하다는 점에서 많은 비용이 소요된다는 문제점이 있었다.Recently, broadcasters have introduced camera equipment directly and hired directors to provide direct cam images. However, there was a problem in that a high cost was required in that individual cameras and directors were required for each member.

따라서, 본 개시의 목적은 AI 기술을 이용하여 촬영 영상 내의 인물별 콘텐츠를 자동으로 생성 가능한 전자 장치 및 콘텐츠 생성 방법을 제공하는 데 있다.Accordingly, an object of the present disclosure is to provide an electronic device and a content generation method capable of automatically generating content for each person in a captured image using AI technology.

이상과 같은 목적을 달성하기 위한 본 개시에 따른 전자 장치는 촬영 영상을 입력받는 통신 장치, 상기 입력된 촬영 영상을 저장하는 메모리, 및 상기 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 포함된 얼굴 이미지를 추출하고, 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링하고, 동일 인물로 클러스터링된 얼굴 이미지를 기초로 인물별 궤적 정보를 생성하고, 상기 입력된 촬영 영상 및 상기 생성된 궤적 정보를 이용하여 인물별 콘텐츠를 생성하는 프로세서를 포함한다.An electronic device according to the present disclosure for achieving the above object is a communication device for receiving a photographed image, a memory for storing the input photographed image, and a face included in each of a plurality of frames included in the input photographed image. Extracting an image, clustering the extracted face image into a plurality of people, generating track information for each person based on the face image clustered with the same person, and using the input image and the generated track information It includes a processor for generating person-specific content.

이 경우, 상기 프로세서는 기학습된 얼굴 인식 AI 알고리즘을 이용하여, 상기 복수의 프레임 각각에 포함된 얼굴을 감지하고, 감지된 얼굴에 대응되는 얼굴 이미지, 상기 얼굴 이미지의 좌표 정보 및 상기 얼굴 이미지의 크기 정보를 추출할 수 있다.In this case, the processor detects a face included in each of the plurality of frames using a pre-learned face recognition AI algorithm, and detects a face image corresponding to the detected face, coordinate information of the face image, and the face image. Size information can be extracted.

한편, 상기 프로세서는 상기 추출된 얼굴 이미지 각각에 대해서 128차원의 특징 벡터를 산출하고, 산출된 특징 벡터의 상호 유사도를 계산하여 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링할 수 있다.The processor may cluster the extracted face images into a plurality of persons by calculating a 128D feature vector for each of the extracted face images, calculating a mutual similarity of the calculated feature vectors.

이 경우, 상기 프로세서는 기학습된 클러스터링 AI 알고리즘에 상기 산출된 특징 벡터를 입력하여, 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링할 수 있다.In this case, the processor may input the calculated feature vector into a pre-learned clustering AI algorithm to cluster the extracted face image into a plurality of persons.

한편, 상기 프로세서는 동일 인물로 클러스터링된 얼굴 이미지를 갖는 프레임 내의 얼굴 좌표 정보를 연결하여 인물별 궤적 정보를 생성할 수 있다.Meanwhile, the processor may generate face information for each person by connecting face coordinate information in a frame having face images clustered with the same person.

이 경우, 상기 프로세서는 기설정된 프레임에서 특정 인물의 얼굴 이미지가 없으면 얼굴 이미지가 있는 직전 프레임의 얼굴 좌표 정보를 이용하여 궤적 정보를 생성할 수 있다.In this case, if there is no face image of a specific person in a preset frame, the processor may generate trajectory information using face coordinate information of a frame immediately before the face image.

한편, 상기 프로세서는 기설정된 프레임 이상 특정 인물의 얼굴 이미지가 없으면, 상기 프레임 구간에서는 다른 인물의 사용자 궤적 정보를 이용하여 상기 특정 인물의 궤적 정보를 생성할 수 있다.In the meantime, if there is no face image of a specific person over a preset frame, the processor may generate track information of the specific person by using user trajectory information of another person in the frame section.

한편, 상기 프로세서는 상기 생성된 인물별 궤적 정보를 검증할 수 있다.The processor may verify the generated trajectory information for each person.

이 경우, 상기 프로세서는 상기 생성된 궤적 정보를 이용하여 구간별 이동 속도를 산출하고, 기설정된 값 이상의 이동 속도를 갖는 구간에 대응되는 프레임에 오류가 있는 것으로 판단할 수 있다.In this case, the processor may calculate the movement speed for each section using the generated trajectory information, and determine that there is an error in the frame corresponding to the section having the movement speed more than a predetermined value.

한편, 본 전자 장치는 입력 영상의 전체 구간 중 특정 인물의 얼굴이 감지된 구간을 표시하는 제1 영역, 상기 입력 영상의 특정 프레임 및 상기 특정 프레임에서의 감지된 얼굴을 표시하는 제2 영역 및 상기 특정 프레임에서 상기 특정 인물에 대응되는 추출 이미지를 표시하는 제3 영역을 포함하는 사용자 인터페이스 창을 표시하는 디스플레이를 더 포함할 수 있다.Meanwhile, the electronic device may include a first region displaying a section in which a face of a specific person is detected among all sections of the input image, a second region displaying a specific frame of the input image and a detected face in the specific frame, and the The display apparatus may further include a display configured to display a user interface window including a third area displaying an extracted image corresponding to the specific person in a specific frame.

한편, 상기 프로세서는 상기 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 대해서 가로 길이보다 세로 길이가 긴 인물별로 세로 이미지를 추출하고, 상기 추출된 세로 이미지를 이용하여 인물별 세로 콘텐츠를 생성할 수 있다.The processor may extract a vertical image for each of the plurality of frames included in the input photographed image for each person having a length longer than the horizontal length, and generate vertical content for each person by using the extracted vertical image. have.

이 경우, 상기 프로세서는 상기 생성된 궤적 정보를 이용하여 프레임 내의 인물의 얼굴 위치를 확인하고, 확인된 얼굴 위치를 기준으로 상기 인물의 얼굴 크기에 대응되는 이미지 영역을 세로 이미지로 추출할 수 있다.In this case, the processor may identify the face position of the person in the frame by using the generated trajectory information, and extract an image area corresponding to the face size of the person as a vertical image based on the identified face position.

한편, 상기 프로세서는 상기 생성된 인물별 콘텐츠 중 복수의 인물 콘텐츠를 병합하여 병합 콘텐츠를 생성할 수 있다.The processor may generate merged content by merging a plurality of person contents among the generated person contents.

한편, 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법은 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 포함된 얼굴 이미지를 추출하는 단계, 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링하는 단계, 동일 인물로 클러스터링된 얼굴 이미지를 기초로 인물별 궤적 정보를 생성하는 단계, 및 상기 입력된 촬영 영상 및 상기 생성된 궤적 정보를 이용하여 인물별 콘텐츠를 생성하는 단계를 포함한다.On the other hand, the content generating method according to an embodiment of the present disclosure is the step of extracting a face image included in each of the plurality of frames included in the input image, clustering the extracted face image into a plurality of people, the same Generating trajectory information for each person based on a face image clustered with the person, and generating content for each person using the input image and the generated trajectory information.

이 경우, 상기 얼굴 이미지를 추출하는 단계는 기학습된 얼굴 인식 AI 알고리즘을 이용하여, 상기 복수의 프레임 각각에 포함된 얼굴을 감지하고, 감지된 얼굴에 대응되는 얼굴 이미지, 상기 얼굴 이미지의 좌표 정보 및 상기 얼굴 이미지의 크기 정보를 추출할 수 있다.In this case, the extracting of the face image may include detecting a face included in each of the plurality of frames using a pre-learned face recognition AI algorithm, and a face image corresponding to the detected face and coordinate information of the face image. And size information of the face image.

한편, 상기 클러스터링하는 단계는 상기 추출된 얼굴 이미지 각각에 대해서 128차원의 특징 벡터를 산출하고, 산출된 특징 벡터의 상호 유사도를 계산하여 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링할 수 있다.On the other hand, in the clustering step, a 128-dimensional feature vector may be calculated for each of the extracted face images, and the similarity of the calculated feature vectors may be calculated to cluster the extracted face images into a plurality of persons.

한편, 상기 궤적 정보를 생성하는 단계는 동일 인물로 클러스터링된 얼굴 이미지를 갖는 프레임 내의 얼굴 좌표 정보를 연결하여 인물별 궤적 정보를 생성할 수 있다.In the generating of the locus information, the locus information of each person may be generated by connecting face coordinate information in a frame having face images clustered into the same person.

한편, 본 콘텐츠 생성 방법은 상기 생성된 궤적 정보를 이용하여 구간별 이동 속도를 산출하고, 산출된 이동 속도가 기설정된 값 이상을 갖는지를 확인하여 생성된 인물별 궤적 정보를 검증하는 단계를 더 포함할 수 있다.On the other hand, the content generation method further comprises the step of calculating the movement speed for each section using the generated trajectory information, and verifying the generated trajectory information for each person by checking whether the calculated movement speed has a predetermined value or more. can do.

한편, 본 콘텐츠 생성 방법은 입력 영상의 전체 구간 중 특정 인물의 얼굴이 감지된 구간을 표시하는 제1 영역, 상기 입력 영상의 특정 프레임 및 상기 특정 프레임에서의 감지된 얼굴을 표시하는 제2 영역 및 상기 특정 프레임에서 상기 특정 인물에 대응되는 추출 이미지를 표시하는 제3 영역을 포함하는 사용자 인터페이스 창을 표시하는 단계를 더 포함할 수 있다.In the meantime, the content generation method includes a first region displaying a section in which a face of a specific person is detected among all sections of an input image, a second region displaying a specific frame of the input image and a detected face in the specific frame; The method may further include displaying a user interface window including a third area displaying an extracted image corresponding to the specific person in the specific frame.

한편, 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록매체에 있어서, 상기 콘텐츠 생성 방법은 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 포함된 얼굴 이미지를 추출하는 단계, 상기 추출된 얼굴 이미지를 복수의 인물로 클러스터링하는 단계, 동일 인물로 클러스터링된 얼굴 이미지를 기초로 인물별 궤적 정보를 생성하는 단계, 및 상기 입력된 촬영 영상 및 상기 생성된 궤적 정보를 이용하여 인물별 콘텐츠를 생성하는 단계를 포함한다.Meanwhile, in a computer-readable recording medium including a program for executing a content generating method according to an embodiment of the present disclosure, the content generating method includes a face image included in each of a plurality of frames included in an input photographed image. Extracting an image, clustering the extracted face image into a plurality of persons, generating trajectory information for each person based on the face image clustered with the same person, and the input image and the generated trajectory information Generating the content for each person by using;

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 하나의 촬영 영상만으로 다수의 인물별 콘텐츠를 생성할 수 있는바, 비용 절감이 가능하며 무대에 등장하는 인물별로 편집된 개별 세로형 영상을 신속하게 일괄 생성할 수 있다. 또한, 세로형 영상을 후반부편집으로 제작하기 때문에 촬영 대상을 한정하지 않고 필요한 인물의 직캠 영상을 언제든 생성할 수 있다. 또한, AI 얼굴 인식 기술을 이용하여 정밀한 인물별 트레킹이 가능한바, 고품질의 콘텐츠 생성이 가능하다. 세로형 타임 라인을 통해 가로형 타임 라인 형태의 NLE보다 얼굴 위치 검출 결과의 검수 및 보정을 보다 빠르게 수행할 수 있다.As described above, according to various embodiments of the present disclosure, the content for each person may be generated using only one captured image, thereby reducing costs and rapidly editing individual vertical images edited for each person appearing on the stage. You can create a batch. In addition, since the vertical image is produced by the second half editing, the direct cam image of the required person can be generated at any time without limiting the shooting target. In addition, it is possible to precisely track by person using AI face recognition technology, it is possible to create high-quality content. The vertical timeline enables faster inspection and correction of face position detection results than an NLE in the horizontal timeline.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면,
도 3은 본 개시의 일 실시 예에 따른 AI 기술을 이용한 콘텐츠 생성 방법을 설명하기 위한 도면,
도 4는 이미지 크롭 방식을 설명하기 위한 도면,
도 5는 얼굴 검출 방식을 설명하기 위한 도면,
도 6은 검출된 얼굴 이미지를 이용한 클러스터링 방식을 설명하기 위한 도면,
도 7은 사용자 인터페이스 창의 일 예를 설명하기 위한 도면,
도 8은 본 개시에 따라 생성된 멤버별 콘텐츠와 종래의 콘텐츠의 비교 예를 도시한 도면,
도 9는 추가 실시 예에 따라 생성되는 콘텐츠의 일 예를 도시한 도면, 그리고,
도 10은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다.1 is a view showing the configuration of a content generation system according to an embodiment of the present disclosure;
2 illustrates a detailed configuration of an electronic device according to an embodiment of the present disclosure;
3 is a view for explaining a content generation method using an AI technology according to an embodiment of the present disclosure;
4 is a view for explaining an image cropping method;
5 is a view for explaining a face detection method;
6 is a view for explaining a clustering method using the detected face image;
7 is a view for explaining an example of a user interface window;
8 is a view showing a comparison example of the content for each member generated according to the present disclosure and the conventional content,
9 is a diagram illustrating an example of content generated according to a further embodiment; and
10 is a diagram for describing a content generation method according to one embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다.Terms used herein will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments of the present disclosure selected general terms widely used as far as possible in consideration of functions in the present disclosure, but may vary according to the intention or precedent of a person skilled in the art, the emergence of new technologies, and the like. . In addition, in certain cases, there is also a term arbitrarily selected by the applicant, in which case the meaning will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meanings of the terms and the contents throughout the present disclosure, rather than simply the names of the terms.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Embodiments of the present disclosure may be variously modified and have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the scope to the specific embodiments, it should be understood to include all transformations, equivalents, and substitutes included in the scope of the disclosed spirit and technology. In describing the embodiments, when it is determined that the detailed description of the related known technology may obscure the gist, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are only used to distinguish one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, "includes." Or "made up." And the like are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof described in the specification, and that one or more other features or numbers, step, action, component, part, or It should be understood that they do not preclude the presence or possibility of adding these in advance.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment of the present disclosure, the 'module' or 'unit' performs at least one function or operation, and may be implemented in hardware or software or in a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module except for 'modules' or 'units' that need to be implemented by specific hardware, and may be implemented as at least one processor.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present disclosure. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted for simplicity of explanation, and like reference numerals designate like parts throughout the specification.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, with reference to the drawings will be described in more detail with respect to the present disclosure.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면이다.1 is a diagram illustrating a configuration of a content generation system according to an exemplary embodiment.

도 1을 참조하면, 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템(1000)은 촬영 장치(10), 전자 장치(100) 및 서버(20)로 구성된다.Referring to FIG. 1, the content generation system 1000 according to an exemplary embodiment includes a photographing apparatus 10, an electronic apparatus 100, and a server 20.

촬영 장치(10)는 이미지를 촬상하여 촬영 영상을 생성한다. 이러한 촬영 장치(10)는 이미지를 촬상할 수 있는 디지털 카메라, 캠코더, 휴대폰, PMP, 웹캠, 블랙박스 등일 수 있다.The photographing apparatus 10 captures an image to generate a photographed image. The photographing apparatus 10 may be a digital camera, a camcorder, a mobile phone, a PMP, a webcam, a black box, or the like capable of capturing an image.

그리고 촬영 장치(10)는 복수의 인물을 피사체로 하여 촬영할 수 있는 위치에 배치되어 촬영 동작을 수행할 수 있다. 이때 촬영 장치(10)는 고정된 방향으로만 촬영을 수행할 수도 있으며, 인물들의 동선 변화에 대응하여 방향을 가변하여 촬영을 수행할 수도 있다.In addition, the photographing apparatus 10 may be disposed at a position where a plurality of persons may be photographed as a subject to perform a photographing operation. In this case, the photographing apparatus 10 may photograph only in a fixed direction, or may photograph by changing a direction in response to a change in movement of the person.

그리고 촬영 장치(10)는 생성된 촬영 영상을 전자 장치(100)에 전송할 수 있다. 여기서 촬영 영상을 고화질의 동영상이며, 4K 이상의 해상도를 가질 수 있다. 그리고 촬영 영상은 소리와 같은 음원을 포함할 수 있다. 한편, 구현시에 소리는 촬영 장치 이외의 외부 장치(예를 들어, 가수의 마이크 또는 실제 방송 음원)로부터 직접 수신할 수도 있다.In addition, the photographing apparatus 10 may transmit the generated photographed image to the electronic device 100. Here, the captured image is a high quality video, and may have a resolution of 4K or more. The captured image may include a sound source such as sound. In some implementations, the sound may be directly received from an external device other than the photographing device (for example, a singer's microphone or an actual broadcast sound source).

전자 장치(100)는 촬영 영상을 입력받고, 입력된 촬영 영상을 기초로 영상에 포함된 인물별 콘텐츠를 생성할 수 있다. 구체적으로, 전자 장치(100)는 영상에 포함된 얼굴 이미지를 추출하고, 추출된 얼굴 이미지를 클러스터링하여 인물별 궤적 정보를 생성하고, 생성된 궤적 정보에 기초하여 인물별 콘텐츠를 생성할 수 있다. 이러한 전자 장치(100)의 구체적인 구성 및 동작에 대해서는 도 2를 참조하여 후술한다.The electronic device 100 may receive a captured image and generate content for each person included in the image based on the input captured image. In detail, the electronic device 100 may extract face images included in the image, cluster the extracted face images, generate trajectory information for each person, and generate content for each person based on the generated trajectory information. A detailed configuration and operation of the electronic device 100 will be described later with reference to FIG. 2.

그리고 전자 장치(100)는 생성된 인물별 콘텐츠를 서버(20)에 제공할 수 있다. 여기서 전자 장치(100)와 서버(20)는 인터넷망으로 연결될 수 있으며, 전자 장치(100)와 촬영 장치(10)는 내부 네트워크 망 또는 직접 연결될 수도 있다.The electronic device 100 may provide the generated person-specific content to the server 20. Here, the electronic device 100 and the server 20 may be connected to the Internet network, and the electronic device 100 and the photographing device 10 may be directly connected to an internal network network.

서버(20)는 인물별 콘텐츠를 수신하며, 수신한 인물별 콘텐츠를 저장한다. 그리고 서버(20)는 개별 단말장치의 요청에 대응하여, 저장한 인물별 콘텐츠를 전송하거나 스트리밍 방식으로 제공할 수 있다.The server 20 receives the content for each person and stores the received content for each person. In addition, the server 20 may transmit or provide the stored content for each person in response to a request of an individual terminal device.

이와 같이 본 실시 예에 따른 콘텐츠 생성 시스템(1000)은 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 또한, 인물별 콘텐츠를 생성하는 데 있어서, 하나의 촬영 영상만 필요하다는 점에서 비용절감이 가능하며, AI 기술을 이용하여 인물별 콘텐츠를 생성한다는 점에서 더욱 고품질의 콘텐츠를 더욱 손쉽고 빠르게 생산하는 것이 가능하다.As described above, the content generation system 1000 according to the present exemplary embodiment may generate and provide a plurality of contents (ie, content for each person) using only the captured image generated by one photographing apparatus. In addition, it is possible to reduce costs in that only one shot is required to generate content for each person, and to produce content of higher quality more easily and quickly in terms of creating content for each person using AI technology. It is possible.

한편, 도 1을 설명함에 있어서, 촬영 장치(10)와 전자 장치(100) 및 서버(20)와 전자 장치(100)가 상호 직접 연결되는 형태로 도시하였지만, 구현시에 각 구성들은 별도의 외부 구성을 경유하는 형태로 연결될 수 있다. 또한, 전자 장치(100)가 생성한 콘텐츠를 하나의 서버(20)에만 제공하는 것으로 설명하였지만, 전자 장치(100)는 복수의 서버(20)에 제공하는 것도 가능하다.Meanwhile, in FIG. 1, the photographing apparatus 10, the electronic device 100, the server 20, and the electronic device 100 are illustrated as directly connected to each other. It may be connected in a form via configuration. In addition, although it has been described that the content generated by the electronic device 100 is provided to only one server 20, the electronic device 100 may also provide the plurality of servers 20.

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면이다.2 illustrates a detailed configuration of an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 통신 장치(110), 메모리(120), 디스플레이(130), 조작 입력장치(140) 및 프로세서(150)로 구성될 수 있다. 여기서 전자 장치(100)는 이미지 프로세싱이 가능한 PC, 노트북 PC, 스마트폰, 서버 등일 수 있다.Referring to FIG. 2, the electronic device 100 may include a communication device 110, a memory 120, a display 130, a manipulation input device 140, and a processor 150. The electronic device 100 may be a PC, a notebook PC, a smartphone, a server, or the like capable of image processing.

통신 장치(110)는 촬영 장치(10) 또는 서버(20)와 연결되며, 콘텐츠를 송수신할 수 있다. 구체적으로, 통신 장치(110)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 모바일 장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트를 통하여 접속되는 형태도 가능하다.The communication device 110 may be connected to the photographing device 10 or the server 20 and may transmit and receive content. Specifically, the communication device 110 is formed to connect the electronic device 100 to an external device, and is connected to a mobile device through a local area network (LAN) and an internet network, as well as a USB (Universal). Serial Bus) can also be connected via the port.

또한, 통신 장치(110)는 유선 방식뿐만 아니라, 공용 인터넷망에 연결되는 라우터 또는 공유기를 경유하여 다른 전자 장치에 연결될 수 있으며, 라우터 또는 공유기와는 유선 방식뿐만 아니라 와이파이, 블루투스, 셀룰러 통신 등의 무선 방식으로도 연결될 수 있다. 여기서 셀룰러 통신은 예를 들면, LTE(Long-Term Evolution), LTE-A(LTE Advance), CDMA(code division multiple access), WCDMA(wideband CDMA), UMTS(universal mobile telecommunications system), WiBro(Wireless Broadband), 또는 GSM(Global System for Mobile Communications)을 포함할 수 있다.In addition, the communication device 110 may be connected to another electronic device through a router or a router connected to a public Internet network as well as a wired method, and may be connected to a router or a router with a wired method as well as Wi-Fi, Bluetooth, cellular communication, or the like. It can also be connected wirelessly. The cellular communication may include, for example, Long-Term Evolution (LTE), LTE Advance (LTE-A), Code Division Multiple Access (CDMA), Wideband CDMA (WCDMA), Universal Mobile Telecommunications System (UMTS), and Wireless Broadband (WiBro). ), Or Global System for Mobile Communications (GSM).

통신 장치(110)는 촬영 장치(10)로부터 촬영 영상을 입력받을 수 있다. 이러한 촬영 영상은 다양한 영상 포맷을 가질 수 있다. 이러한 촬영 영상은 무대 위의 가수를 촬영한 영상일 수 있지만, 이에 한정되지 않고 복수의 인물이 등장하는 것이라면, 오페라 무대, 연극, 스포츠 영상 등 다양할 수 있다.The communication device 110 may receive a captured image from the photographing device 10. The captured image may have various image formats. The photographed image may be a photographed image of a singer on a stage, but is not limited thereto and may include various characters such as an opera stage, a play, and a sports image.

그리고 통신 장치(110)는 별도의 외부 장치로부터 음원 콘텐츠를 입력받을 수 있다. 여기서 음원 콘텐츠는 촬영 영상에 대응되는 음원으로, 하나의 음원일 수 있으며, 복수의 인원별 음원일 수도 있다. 즉, 통신 장치(110)는 복수의 인원 각각에 대한 복수의 음성 콘텐츠를 입력받을 수 있다. 이때, 수신되는 복수의 음원 콘텐츠는 가수 멤버별 마이크를 통하여 직접 제공된 것일 수 있으며, 방송 시스템을 통하여 제공된 것일 수도 있다.The communication device 110 may receive sound source content from a separate external device. The sound source content may be a sound source corresponding to the captured image, and may be one sound source or a plurality of sound sources for each person. That is, the communication device 110 may receive a plurality of voice contents for each of the plurality of personnel. In this case, the plurality of received sound source contents may be directly provided through a microphone for each singer member, or may be provided through a broadcasting system.

그리고 통신 장치(110)는 전자 장치(100)에서 생성한 인물별 콘텐츠를 서버(20)에 전송할 수 있다.In addition, the communication device 110 may transmit the person-specific content generated by the electronic device 100 to the server 20.

메모리(120)는 전자 장치(100)를 구동하기 위한 O/S나 인물별 콘텐츠를 생성하기 위한 소프트웨어, AI 알고리즘, 데이터 등을 저장하기 위한 구성요소이다. 메모리(120)는 RAM이나 ROM, 플래시 메모리, HDD, 외장 메모리, 메모리 카드 등과 같은 다양한 형태로 구현될 수 있으며, 어느 하나로 한정되는 것은 아니다.The memory 120 is a component for storing O / S for driving the electronic device 100, software for generating content for each person, AI algorithm, data, and the like. The memory 120 may be implemented in various forms such as RAM, ROM, flash memory, HDD, external memory, memory card, and the like, but is not limited thereto.

메모리(120)는 촬영 영상을 저장한다. 구체적으로, 메모리(120)는 통신 장치(110)를 통하여 수신한 촬영 영상을 저장할 수 있다. 그리고 메모리(120)는 영상 편집을 위하여 촬영 영상에서 추출된 복수의 프레임을 임시 저장할 수 있으며, 최종 생성된 인물별 콘텐츠를 저장할 수 있다. 여기서 촬영 영상은 AVI, MOV 등과 같은 동영상 파일일 수 있다.The memory 120 stores the captured image. In detail, the memory 120 may store a captured image received through the communication device 110. In addition, the memory 120 may temporarily store a plurality of frames extracted from the captured image for image editing, and may store content for each person generated last. The captured image may be a video file such as AVI, MOV, or the like.

또한, 메모리(120)는 콘텐츠 생성 과정에서 산출되는 추출된 얼굴 이미지, 궤적 정보 등을 저장할 수 있다.In addition, the memory 120 may store the extracted face image, trajectory information, and the like calculated during the content generation process.

그리고 메모리(120)는 얼굴 이미지 검출 및 얼굴 이미지 클러스터링을 위한 AI 알고리즘(또는 프로그램)을 저장할 수 있다. 여기서 AI 알고리즘은 Tensorflow 기반의 기계학습용 프레임워크에서 동작하는 알고리즘으로, facenet, chinese whispers 등일 수 있다.The memory 120 may store an AI algorithm (or program) for face image detection and face image clustering. Here, the AI algorithm is an algorithm that operates in a Tensorflow-based machine learning framework, and may be facenet or chinese whispers.

디스플레이(130)는 전자 장치(100)가 지원하는 기능을 선택받기 위한 사용자 인터페이스 창을 표시한다. 구체적으로, 디스플레이(130)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 이러한 디스플레이(130)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력장치(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다.The display 130 displays a user interface window for selecting a function supported by the electronic device 100. In detail, the display 130 may display a user interface window for selecting various functions provided by the electronic device 100. The display 130 may be a monitor such as an LCD, a CRT, an OLED, or the like, and may be implemented as a touch screen capable of simultaneously performing a function of the manipulation input device 140 to be described later.

디스플레이(130)는 인물별 궤적 정보를 검증 또는 수정하기 위한 사용자 인터페이스 창을 표시할 수 있다. 이러한 사용자 인터페이스 창은 입력 영상의 전체 구간 중 특정 인물의 얼굴이 감지된 구간을 표시하는 제1 영역, 입력 영상의 특정 프레임 및 특정 프레임에서의 감지된 얼굴을 표시하는 제2 영역 및 특정 프레임에서 특정 인물에 대응되는 추출 이미지를 표시하는 제3 영역을 포함할 수 있다. 디스플레이(130)에서 표시 가능한 사용자 인터페이스 창의 예에 대해서는 도 7을 참조하여 후술한다.The display 130 may display a user interface window for verifying or correcting the trajectory information for each person. The user interface window may include a first area displaying a section in which a face of a specific person is detected among all sections of the input image, a second frame displaying a detected frame in a specific frame and a specific frame, and a specific frame in a specific frame. It may include a third area for displaying the extracted image corresponding to the person. An example of a user interface window that can be displayed on the display 130 will be described later with reference to FIG. 7.

그리고 디스플레이(130)는 궤적 정보에 대한 검증 결과를 표시할 수 있다. 그리고 디스플레이(130)는 입력된 촬영 영상을 표시하거나, 생성된 인물별 콘텐츠를 표시할 수 있다.In addition, the display 130 may display a verification result for the trajectory information. The display 130 may display the input photographed image or may display the generated content for each person.

조작 입력장치(140)는 사용자로부터 전자 장치(100)의 기능 선택 및 해당 기능에 대한 제어 명령을 입력받을 수 있다. 구체적으로, 조작 입력장치(140)는 산출된 궤적 정보를 수정하는 사용자 제어 명령을 입력받을 수 있다. 이러한 사용자 제어 명령은 상술한 사용자 인터페이스 창을 통하여 입력될 수 있다.The manipulation input device 140 may receive a function selection of the electronic device 100 and a control command for the corresponding function from the user. In detail, the manipulation input apparatus 140 may receive a user control command for correcting the calculated trajectory information. The user control command may be input through the above-described user interface window.

프로세서(150)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(150)는 사용자로부터 부팅 명령이 입력되면, 메모리(120)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다.The processor 150 performs control of each component in the electronic device 100. In detail, when a boot command is input from a user, the processor 150 may perform booting using an operating system stored in the memory 120.

프로세서(150)는 촬영 영상이 입력되면, 입력된 촬영 영상에 포함된 복수의 프레임을 추출한다. 이때, 복수의 프레임은 촬영 영상에 포함된 모든 프레임일 수 있으며, 기설정된 시간 간격(예를 들어, 0.5초)단위로 추출된 프레임일 수 있다. 추출된 프레임은 BMP, JPG 등 이미지 포맷을 가질 수 있다.When the captured image is input, the processor 150 extracts a plurality of frames included in the input captured image. In this case, the plurality of frames may be all frames included in the captured image, and may be frames extracted at predetermined time intervals (for example, 0.5 seconds). The extracted frame may have an image format such as BMP or JPG.

그리고 프로세서(150)는 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 포함된 얼굴 이미지를 추출한다. 구체적으로, 프로세서(150)는 얼굴 검출 동작에 대해서 학습된 AI 알고리즘을 이용하여, 복수의 프레임 각각에 포함된 얼굴을 검출할 수 있다. 여기서, AI 알고리즘은 tensorflow 기반의 얼굴 검출 알고리즘을 사용하며, 얼굴 검출이 가능한 다양한 AI 알고리즘이 이용될 수 있다.The processor 150 extracts a face image included in each of the plurality of frames included in the input captured image. In detail, the processor 150 may detect a face included in each of the plurality of frames by using an AI algorithm learned about the face detection operation. Here, the AI algorithm uses a tensorflow-based face detection algorithm, and various AI algorithms capable of face detection may be used.

그리고 프로세서(150)는 프레임에서 얼굴이 검출되면, 검출된 얼굴 영역에 대응한 얼굴 이미지를 추출하고, 얼굴 이미지의 좌표, 및 크기 정보를 추출할 수 있다. 한편, 프로세서(150)는 하나의 프레임에 복수의 얼굴이 검출되면, 검출된 얼굴별 이미지, 좌표 및 크기 정보 등을 추출할 수 있다.When the face is detected in the frame, the processor 150 may extract a face image corresponding to the detected face region, and extract coordinates and size information of the face image. When a plurality of faces are detected in one frame, the processor 150 may extract the detected face-specific image, coordinates, and size information.

그리고 프로세서(150)는 추출된 얼굴 이미지를 복수의 인물로 클러스터링한다. 구체적으로, 프로세서(150)는 추출된 얼굴 이미지 각각에 대해서 128차원의 특징 벡터를 산출할 수 있다. 이후에 프로세서(150)는 이미지별 특징 벡터 값을 상호 비교하여, 즉 산출된 특징 벡터의 상호 유사도를 계산하여 추출된 복수의 이미지를 복수의 인물로 클러스터링할 수 있다. 이때, 프로세서(150)는 클러스터링 알고리즘에 산출된 특징 벡터를 입력하여, 추출된 얼굴 이미지를 복수의 인물로 클러스터링할 수 있다.The processor 150 clusters the extracted face image into a plurality of persons. In detail, the processor 150 may calculate a 128-dimensional feature vector for each of the extracted face images. Subsequently, the processor 150 may cluster the extracted plurality of images into a plurality of persons by comparing the feature vector values of each image, that is, calculating the mutual similarity of the calculated feature vectors. In this case, the processor 150 may input the feature vector calculated in the clustering algorithm and cluster the extracted face image into a plurality of persons.

그리고 프로세서(150)는 동일 인물로 클러스터링된 얼굴 이미지를 기초로 인물별 궤적 정보를 생성한다. 구체적으로, 프로세서(150)는 동일 인물로 클러스터링된 얼굴 이미지를 갖는 프레임 내의 얼굴 좌표 정보를 연결하여 인물별 궤적 정보를 생성할 수 있다.The processor 150 generates trajectory information for each person based on face images clustered into the same person. In detail, the processor 150 may generate face information for each person by connecting face coordinate information in a frame having face images clustered into the same person.

한편, 무대 위에 가수는 얼굴이 보이지 않는 방향(예를 들어, 얼굴을 아래 방향으로 보고 있거나, 관객을 등진 형태로 서 있는 형태 등)에 위치할 수도 있다. 이러한 경우에 대비하기 위하여, 프로세서(150)는 기설정된 프레임에서 특정 인물의 얼굴 이미지가 없으면 얼굴 이미지가 있는 직전 프레임의 얼굴 좌표 정보를 이용하여 궤적 정보를 생성할 수 있다. 한편, 구현시에는 얼굴 이미지가 없는 프레임 이후에 얼굴 이미지가 있는 프레임의 궤적 정보를 이용할 수 있으며, 직전 및 직후의 평균 궤적 정보를 이용할 수도 있다.On the other hand, the singer on the stage may be located in a direction in which the face is not visible (for example, looking at the face downward or standing in the back of the audience). In order to prepare for such a case, if there is no face image of a specific person in the preset frame, the processor 150 may generate trajectory information by using face coordinate information of a frame immediately before the face image. In the implementation, the trajectory information of the frame with the face image may be used after the frame without the face image, and the average trajectory information immediately before and after the face image may be used.

한편, 무대 위에 가수는 움직이는 동작에 따라 무대 밖에 잠시 나갈 수도 있다. 이러한 경우를 대비하기 위하여, 프로세서(150)는 기설정된 프레임 이상 특정 인물의 얼굴 이미지가 없으면, 프레임 구간에서는 다른 인물의 사용자 궤적 정보를 이용하여 특정 인물의 궤적 정보를 생성할 수 있다. 이때 다른 인물은 가장 긴 궤적 정보를 갖는 인물일 수 있다. 한편, 구현시에는 다른 인물의 궤적이 아닌 얼굴 이미지가 있는 마지막 프레임의 얼굴 좌표 정표 정보를 이용하는 형태로도 구현할 수 있다.On the other hand, the singer on the stage may go out of the stage for a while depending on the movement. In order to prepare for such a case, if there is no face image of a specific person over a preset frame, the processor 150 may generate the track information of the specific person using user trajectory information of another person in the frame section. In this case, the other person may be a person having the longest trajectory information. On the other hand, the implementation may also be implemented in the form of using the coordinate information of the face coordinates of the last frame in which the face image, not the trajectory of another person.

그리고 프로세서(150)는 생성된 인물별 궤적 정보를 검증한다. 구체적으로, 프로세서(150)는 생성된 궤적 정보를 이용하여 구간별 이동 속도를 산출하고, 기설정된 값 이상의 이동 속도를 갖는 구간에 대응되는 프레임에 오류가 있는 것으로 판단할 수 있다. 이때, 프로세서(150)는 오류가 확인된 프레임 영역에 대한 궤적 수정을 위한 사용자 인터페이스 창이 표시되도록 디스플레이(130)를 제어하고, 조작 입력장치(140)를 통하여 입력된 수정 명령에 따라 궤적을 수정할 수 있다.The processor 150 verifies the generated trajectory information for each person. In detail, the processor 150 may calculate the movement speed for each section using the generated trajectory information, and determine that there is an error in the frame corresponding to the section having the movement speed greater than or equal to a preset value. In this case, the processor 150 may control the display 130 to display a user interface window for correcting the trajectory of the frame region in which the error is confirmed, and may modify the trajectory according to a correction command input through the manipulation input device 140. have.

그리고 프로세서(150)는 입력된 촬영 영상 및 생성된 궤적 정보를 이용하여 인물별 콘텐츠를 생성한다. 구체적으로, 프로세서(150)는 입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 대해서 가로 길이보다 세로 길이가 긴 세로 이미지를 추출할 수 있다. 여기서 세로 이미지는 스마트폰의 세로 모드에 적합한 해상도를 가질 수 있다.The processor 150 generates content for each person using the input photographed image and the generated trajectory information. In detail, the processor 150 may extract a vertical image having a length longer than the width of each of the plurality of frames included in the input captured image. The vertical image may have a resolution suitable for the vertical mode of the smartphone.

이때, 프로세서(150)는 생성된 궤적 정보를 이용하여 프레임 내의 인물의 얼굴 위치를 확인하고, 확인된 얼굴 위치를 기준으로 인물의 얼굴 크기에 대응되는 세로 이미지를 추출할 수 있다. 예를 들어, 무대의 앞에 위치하였는지, 뒤에 위치하는지에 따라 인물의 얼굴 크기는 가변될 수 있다. 이에 대응하여 프로세서(150)는 영상의 줌인/줌아웃 동작과 가능한 효과가 나타나도록 확인된 얼굴 위치를 기준으로 추출 영역의 크기를 가변할 수 있다. 영역의 크기가 가변되더라도 추출되는 세로 이미지의 가로/세로 비율을 동일한 값을 유지할 수 있다. 이때, 비율은 스마트폰의 가로/세로 비율에 대응될 수 있다.In this case, the processor 150 may identify the face position of the person in the frame by using the generated trajectory information and extract a vertical image corresponding to the face size of the person based on the identified face position. For example, the face size of the person may vary depending on whether they are located in front of or behind the stage. Correspondingly, the processor 150 may vary the size of the extraction region based on the confirmed face position so that the zoom-in / zoom-out operation of the image and the possible effects may occur. Even if the size of the region is variable, the aspect ratio of the extracted vertical image may be maintained at the same value. In this case, the ratio may correspond to the horizontal / vertical ratio of the smartphone.

한편, 구현시에 프로세서(150)는 가능하면 인물의 얼굴 위치가 중앙에 있도록 하되, 마진을 두고 얼굴의 위치가 일정 이상 이동되는 경우에만 추출된 세로 이미지의 영역을 이동하는 형태로도 세로 이미지를 추출할 수 있다.On the other hand, in the implementation, the processor 150 allows the face of the person to be located in the center if possible, but moves the area of the extracted vertical image only when the position of the face is shifted more than a certain amount of margin. Can be extracted.

한편, 특정 사용자의 얼굴이 프레임의 양단에 위치하여, 해당 얼굴을 기준으로 이미지를 추출하면 정상적인 세로 이미지를 추출하지 못하는 경우가 발생할 수 있다. 이러한 경우, 프로세서(150)는 프레임 영상이 없는 영역은 검은색(또는 블랭크 영역)으로 채워진 세로 이미지를 추출할 수 있다. 또는 프로세서(150)는 세로 이미지를 추출하는 기준이 되는 키 프레임이 프레임 밖에 위치하지 않도록, 즉, 해당 경우에만 해당 인물의 얼굴이 중심에 위치하지 않도록 할수 있다.Meanwhile, when a face of a specific user is located at both ends of a frame and an image is extracted based on the face, a normal vertical image may not be extracted. In this case, the processor 150 may extract a vertical image filled with a black (or blank area) in an area without a frame image. Alternatively, the processor 150 may prevent the key frame, which is a reference for extracting the vertical image, from being located outside the frame, that is, only if the face of the person is not located in the center.

그리고 프로세서(150)는 추출된 세로 이미지를 이용하여 인물별 세로 콘텐츠를 생성한다. 구체적으로, 프로세서(150)는 인물별로 추출된 세로 이미지를 렌더링(또는 인코딩)하여 세로 콘텐츠(즉, 동영상)을 생성할 수 있다.The processor 150 generates vertical content for each person using the extracted vertical image. In detail, the processor 150 may generate vertical content (ie, a video) by rendering (or encoding) a vertical image extracted for each person.

그리고 프로세서(150)는 복수의 음원 데이터를 수신한 경우, 인물에 대응되는 세로 이미지와 해당 인물에 대응되는 음원 데이터를 병합하여 세로 동영상을 생성할 수 있다. 한편, 복수의 음원 데이터와 복수의 세로 콘텐츠 간의 매핑은 사용자의 선택에 의하여 수행될 수 있으며, 목소리 인식 및 얼굴 인식 동작을 통하여 자동으로 수행할 수도 있다.When receiving a plurality of sound source data, the processor 150 may generate a vertical video by merging the vertical image corresponding to the person and the sound source data corresponding to the person. Meanwhile, the mapping between the plurality of sound source data and the plurality of vertical contents may be performed by a user's selection, or may be automatically performed through voice recognition and face recognition.

이와 같이 본 실시 예에 따른 전자 장치(100)는 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 또한, 본 실시 예에 따른 전자 장치(100)는 인물별 콘텐츠를 생성하는 데 있어서, 하나의 촬영 영상만 필요하다는 점에서 비용절감이 가능하며, AI 기술을 이용하여 인물별 콘텐츠를 생성한다는 점에서 더욱 고품질의 콘텐츠를 보다 손쉽고 빠르게 생산하는 것이 가능하다.As described above, the electronic device 100 according to the present embodiment may generate a plurality of contents (ie, content for each person) using only the captured image generated by one photographing apparatus and provide the same to the users. In addition, the electronic device 100 according to the present embodiment can reduce costs in that only one shot image is required to generate content for each person, and in that it generates content for each person by using AI technology. It is possible to produce higher quality content more easily and quickly.

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 얼굴 검출 결과로 프레임에서 얼굴 이미지를 추출하여 이용하는 것으로 설명하였으나, 구현시에는 별도의 얼굴 이미지 추출 동작 없이 얼굴 좌표 정보만을 이용하여, 원본 프레임과 생성된 얼굴 좌표 정보를 이용하여 클러스터링을 수행하는 것도 가능하다. 얼굴 좌표 정보는 사전 검출된 결과나 임의로 설정된 얼굴 좌표, 다른 얼굴 검출 프로그램에서 검출된 얼굴 좌표 정보 등을 활용할 수 있다.Meanwhile, in FIGS. 1 and 2, the face image is extracted and used as a result of face detection. However, in the implementation, only the face coordinate information is used without using a face image extraction operation. It is also possible to perform clustering using the generated face coordinate information. The face coordinate information may utilize a pre-detected result, an arbitrarily set face coordinate, or face coordinate information detected by another face detection program.

도 3은 본 개시의 일 실시 예에 따른 AI 기술을 이용한 콘텐츠 생성 방법을 설명하기 위한 도면이다.3 is a diagram for describing a content generation method using AI technology according to an exemplary embodiment.

도 3을 참조하면, 고화질을 동영상이 입력된다(310). 여기서 입력된 동영상은 4k 화질 이상의 동영상일 수 있다. 입력된 동영상의 화질이 4K 화질 이상인 경우, 크롭(Crop)되는 이미지는 원본이미지보다 작거나 같은 크기를 가지며, 최종 인물별 동영상의 화질은 최대 원본해상도 급을 갖는다.Referring to FIG. 3, a high quality video is input 310. The input video may be a video having 4K quality or higher. If the quality of the input video is 4K or more, the cropped image is smaller than or equal to the original image, and the video quality of each final person has the maximum original resolution.

이하에서는 설명을 용이하게 하기 위하여, 입력된 동영상은 무대 위에 여러 명의 가수를 동시에 촬영한 영상인 것을 가정하여 설명한다.In the following description, it is assumed that the input video is an image of photographing several singers simultaneously on the stage.

댄스 가수의 경우 무대에서 계속해서 이동을 하기 때문에 일반 촬영 방법을 이용하는 경우, 촬영자는 지속적으로 댄스 가수의 이동 동선에 대응하여 카메라의 촬상 방향을 조정하여야 한다. 만약 입력된 영상에서 사용자가 직접 이미지의 크롭 영역을 설정하는 방법을 이용하는 경우, 사용자는 댄스 가수의 이동 동선에 대응하여 크롭 영역을 조정하여야 한다.Since the dance singer moves continuously on the stage, when using the general shooting method, the photographer must continuously adjust the imaging direction of the camera in response to the moving singer's moving line. If the user directly uses the method of setting the crop region of the image in the input image, the user should adjust the crop region in response to the moving line of the dance singer.

그러나 기존의 두 방식 모드, 촬영자 또는 사용자가 직접 촬상 방향 및 크롭 영역을 조정하여야 하는데, 격렬한 댄스를 추는 가수의 경우 매 프레임 단위마다 위치가 변하기 때문에, 사용자가 일일이 사용자의 위치 변화에 대응하여 크롭 영역을 변경하기는 불가능하다.However, the two existing modes, the photographer or the user, need to adjust the direction of the image and the crop area. In the case of a vigorous dance singer, since the position changes every frame unit, the user changes the crop area in response to the user's position change. It is impossible to change.

이에 따라 본 개시에서는 딥 러닝 기반의 얼굴 검출 기술을 이용하여, 크롭 영역을 자동으로 산출한다. 이하에서는 딥러닝 기반의 얼굴 검출 기술을 이용하여 가수별 크롭 영역을 산출하는 방법에 대해서 자세히 설명한다.Accordingly, in the present disclosure, a crop region is automatically calculated using a deep learning based face detection technique. Hereinafter, a method of calculating a crop region for each singer using a deep learning based face detection technology will be described in detail.

우선, 고화질의 동영상이 입력되면, 입력된 동영상에 대한 프레임을 샘플링한다(320). 구체적으로, 모든 프레임에 대한 얼굴 검출 동작을 수행하는 경우, 이미지 처리 시간이 상당히 오래 걸리기 때문에, 처리 속도의 향상을 위하여 기설정된 시간 단위의 프레임만을 우선적으로 샘플링할 수 있다. 예를 들어, 0.5초 단위의 프레임이 샘플링될 수 있다.First, when a high quality video is input, the frame for the input video is sampled (320). In detail, when the face detection operation is performed on all the frames, since the image processing time is considerably longer, only frames of a predetermined time unit may be preferentially sampled to improve the processing speed. For example, a frame of 0.5 seconds may be sampled.

한편, 구현시에는 보다 정밀한 트레킹을 위하여, 상술한 단위보다 세밀한 샘플링을 수행할 수도 있다. 또한, 상술한 수치는 예시적인 것에 불과하여, 실험을 통하여 정밀도 및 처리속도를 모두 만족할 수 있는 시간 값을 산출하여 이용할 수 있다.On the other hand, in the implementation, for more accurate trekking, finer sampling than the above-described units may be performed. In addition, the above-described numerical values are merely exemplary, and may be used by calculating a time value that satisfies both precision and processing speed through experiments.

그리고 샘플링된 프레임 각각에 대해서 얼굴 검출 동작을 수행한다(330). 구체적으로, 딥 러닝의 기반의 얼굴 검출 알고리즘을 이용하여, 샘플링된 프레임 각각에 포함된 얼굴들을 검출할 수 있다. 이와 같은 과정을 통하여 각 프레임에 포함된 얼굴 이미지를 추출할 수 있으며, 해당 얼굴 이미지의 프레임 상의 좌표, 얼굴 이미지의 크기 등의 각종 정보를 생성할 수 있다.In operation 330, a face detection operation is performed on each of the sampled frames. In detail, faces included in each of the sampled frames may be detected by using a face detection algorithm based on deep learning. Through this process, the face image included in each frame may be extracted, and various information such as coordinates on the frame of the face image and the size of the face image may be generated.

그리고 추출된 얼굴 이미지에 대한 클러스터링을 수행한다(340). 우선적으로, 추출된 모든 얼굴 이미지 각각에 대한 128차원의 특징 벡터를 추출하고, 벡터 값으로 표현된 얼굴들에 대한 상호 유사도를 계산하는 동작을 통하여 복수의 인물로 분류할 수 있다. 그 결과로 같은 인물로 묶인 얼굴 이미지의 얼굴 좌표를 연결하면, 인물별 움직임 궤적 데이터를 산출할 수 있다.In operation 340, clustering of the extracted face images is performed. First, a 128-dimensional feature vector for each of the extracted face images may be extracted and classified into a plurality of persons through an operation of calculating mutual similarity with respect to faces expressed as vector values. As a result, by connecting the face coordinates of the face image grouped with the same person, it is possible to calculate the motion trajectory data for each person.

한편, 사람의 얼굴은 카메라에 찍히는 방향에 따라서 다르게 인식될 여지가 있다. 즉, 상술한 클러스터링 과정에 의하여 분리된 결과에 다른 사람이 하나로 묶여있을 수 있으며, 한 사람의 얼굴이 복수의 사람으로 분류되어 있을 수도 있다.On the other hand, the face of the person may be recognized differently depending on the direction to be taken by the camera. That is, different people may be bundled into the result separated by the above-described clustering process, and one person's face may be classified into a plurality of people.

즉, 최초에 생성된 궤적 데이터에는 여러 오류가 존재할 수 있는바, 한 사람으로 분류된 결과 내에 다른 사람의 얼굴이 묶여 있는지를 검증하기 위하여, 속도 검증 동작을 수행한다. 예를 들어, 첫번째 프레임에서 특정 사용자가 좌측에 있는 것으로 확인되었는데, 다음 프레임에서 특정 사용자가 최우측에 있는 것으로 확인될 수 있다. 즉, 물리적인 이동 속도보다 빠르게 이동하였다는 것은 해당 구간에서의 묶임이 잘못된 것일 수 있다.That is, since there may be various errors in the trajectory data generated at first, a speed verification operation is performed to verify whether the face of another person is bound in the result classified as one person. For example, in the first frame, it is confirmed that a specific user is on the left side, and in the next frame, it is confirmed that the specific user is at the rightmost side. That is, moving faster than the physical moving speed may indicate that the bundle in the corresponding section is incorrect.

따라서, 산출된 궤적 데이터를 이용하여, 인물(또는 피사체)의 이동 속도를 계산하고, 계산된 이동 속도가 사람의 물리적인 이동 속도를 넘는 경우에는 오류가 있는 것으로 파악할 수 있다.Therefore, using the calculated trajectory data, it is possible to calculate the moving speed of the person (or subject), and if the calculated moving speed exceeds the physical moving speed of the person, it can be recognized that there is an error.

이러한 1차 검증이 완료되면, 인물별 궤적 정보가 포함된 키 프레임이 추출할 수 있다(360). 여기서 키 프레임은 인물의 궤적 정보에 따라 입력 영상에서 추출될 이미지 영역(즉, 크롭 영역)을 나타내는 정보이다. 이러한 키 프레임은 추출된 인물의 얼굴 좌표를 중심으로, 해당 얼굴의 크기 및 얼굴의 위치 등을 고려하여 결정될 수 있다. 예를 들어, 특정 프레임의 얼굴의 크기가 크게 산출된 경우, 다른 프레임보다 해당 사용자가 앞에 서 있는 것인바, 줌 아웃하는 형태로 크롭 영역을 결정할 수 있다. 또는 특정 프레임의 얼굴 좌표가 다른 것보다 낮은 경우, 해당 인물이 앉는 동작을 수행하는 것으로 볼 수 있는바, 줌 인하는 형태로 크롭 영역을 결정할 수도 있다.When the first verification is completed, a key frame including track information for each person may be extracted (360). The key frame is information indicating an image region (ie, a crop region) to be extracted from the input image according to the person's trajectory information. The key frame may be determined based on the extracted face coordinates of the person in consideration of the size of the face and the position of the face. For example, when the size of a face of a specific frame is calculated to be large, the user is standing in front of another frame, so that the crop region may be determined by zooming out. Alternatively, when the face coordinates of a specific frame are lower than others, it may be considered that the person performs the sitting operation. The crop area may be determined by zooming in.

그리고 크롭 영역은 세로 형태를 가질 수 있다. 즉, 가로 이미지보다 세로 이미지의 길이가 긴 이미지로, 스마트폰의 세로 그립 상태에서 사용자가 시청하기 편한 형태의 해상도를 가질 수 있다.The crop region may have a vertical shape. That is, the vertical image is longer than the horizontal image, and may have a resolution that is easy for a user to view in the vertical grip state of the smartphone.

그리고 상술한 검증을 2차적으로 수행하기 위하여, 키 프레임 수정 동작을 수행한다(370). 이러한 동작을 위하여, 전자 장치(100)는 궤적 검증을 위한 사용자 인터페이스 창을 표시할 수 있다. 표시되는 사용자 인터페이스 창의 예에 대해서는 도 7을 참조하여 후술한다.In order to secondaryly perform the above-described verification, a key frame modification operation is performed (370). For this operation, the electronic device 100 may display a user interface window for trajectory verification. An example of the displayed user interface window will be described later with reference to FIG. 7.

키 프레임 수정이 완료되면, 프레임 각각에 대해서 수정된 키 프레임에 맞춰 이미지를 추출한다(380). 그리고 추출된 이미지와 입력된 영상에 포함된 음성 데이터(또는 외부에서 제공되는 다른 음성 데이터)를 인코딩(또는 렌더링)하여(390), 인물별 동영상을 생성할 수 있다. 이때, 생성되는 동영상의 개수는 앞선 클러스터링 단계에서 분류되는 인물의 개수에 대응될 수 있다. 예를 들어, 앞선 클러스터링 단계에서 5명이 분류된 경우, 생성되는 영상은 5개일 수 있다. 한편, 키프레임 수정 과정에서 만약 5명의 키프레임이 4개로 조정된 경우, 조정된 키프레임 수에 대응되는 4개의 영상이 생성될 수 있다.When key frame modification is completed, an image is extracted according to the modified key frame for each frame (380). In operation 390, a video for each person may be generated by encoding (or rendering) the extracted image and voice data (or other voice data provided from the outside) included in the input image. In this case, the number of generated videos may correspond to the number of persons classified in the previous clustering step. For example, when five people are classified in the previous clustering step, five images may be generated. Meanwhile, if five keyframes are adjusted to four in the keyframe modification process, four images corresponding to the adjusted number of keyframes may be generated.

한편, 사용자가 해당 영상의 인물이 4명인 것으로 입력한 경우, 전자 장치(100)는 자동으로 클러스터링 단계에서 분류된 5명 중 인물 각각에 포함된 얼굴 이미지의 개수의 상위 4개에 대한 궤적만을 산출하여 4개의 동영상을 생성할 수도 있다. 그리고, 생성된 동영상을 외부 서버 등에 전송할 수 있다.On the other hand, when the user inputs that there are four people in the corresponding image, the electronic device 100 automatically calculates only the trajectories for the top four of the number of face images included in each person among the five people classified in the clustering step. You can also create four videos. The generated video may be transmitted to an external server or the like.

한편, 도 3을 도시하고 설명함에 있어서, 가로 길이보다 긴 세로 이미지를 추출하고, 그에 따라 세로 콘텐츠를 생성하는 것으로 설명하였지만, 구현시에는 가로 및 세로 길이가 동일한 콘텐츠를 생성하는 것도 가능하며, 직사각형이 이외에 다른 다각형(예를 들어, 사다리꼴 등) 형태로 콘텐츠를 생성하는 것도 가능하다.Meanwhile, in FIG. 3, the vertical image longer than the horizontal length is extracted and the vertical content is generated accordingly. However, in the implementation, the content having the same horizontal and vertical lengths may be generated. In addition to this, it is also possible to generate the content in the form of another polygon (for example, trapezoid, etc.).

도 4는 이미지 크롭 방식을 설명하기 위한 도면이다.4 is a view for explaining an image cropping method.

도 4를 참조하면, 샘플링된 하나의 프레임(410)에는 4명의 인물이 포함된다. 상술한 AI 얼굴 검출 엔진을 통하여, 프레임에 포함된 4명 각각에 대한 크롭 영역이 설정되고, 도시된 바와 같이 4개의 크롭 이미지(420, 430, 440, 450)가 추출된다.Referring to FIG. 4, four persons are included in one sampled frame 410. Through the AI face detection engine described above, crop areas for each of the four people included in the frame are set, and as shown, four crop images 420, 430, 440, and 450 are extracted.

각각의 크롭 이미지(420, 430, 440, 450)를 보면, 세로 이미지 형태이며, 해당 인물의 얼굴이 중심 위치를 가짐을 확인할 수 있다. 한편, 크롭 이미지에서 인물의 얼굴이 중심 위치에 있도록 하기 위해서는 프레임 내의 인물의 얼굴이 어디에 위치하고 있는지를 알아야 한다. 이하에서는 도 5를 참조하여, 얼굴 검출 동작을 설명한다.Looking at each cropped image 420, 430, 440, 450, it can be seen that it is in the form of a vertical image and the face of the person has a central position. On the other hand, in order to have the face of the person in the center of the crop image, it is necessary to know where the face of the person in the frame is located. Hereinafter, a face detection operation will be described with reference to FIG. 5.

도 5는 얼굴 검출 방식을 설명하기 위한 도면이다.5 is a diagram for describing a face detection method.

도 5를 참조하면, 하나의 프레임(500)에서 4명의 얼굴(510, 520, 530, 540)이 검출됨을 확인할 수 있다. 이러한 얼굴 검출은 Tensorflow 프레임워크 상에서 동작하는 얼굴 검출 라이브러리를 이용하여 수행될 수 있다.Referring to FIG. 5, it can be seen that four faces 510, 520, 530, and 540 are detected in one frame 500. Such face detection may be performed using a face detection library operating on the Tensorflow framework.

한편, 이와 같은 얼굴 검출은 영상을 프레임별로 쪼개 이미지 단위로 수행하는 것인바, 등장 인물이 여러 명인 경우 얼굴 좌표도 여러 개로 반환되기 때문에, 전후 프레임 검출값만을 연결하여 각 인물별 움직임을 추적할 수 없다.On the other hand, such a face detection is performed by splitting the image frame by frame, and if there are several characters, the face coordinates are also returned. Therefore, only the front and rear frame detection values can be connected to track the movement of each person. none.

따라서, 본 개시에서는 인물의 움직임을 추적하기 위하여, 클러스터링 동작을 수행한다. 이에 대해서는 도 6을 참조하여 이하에서 설명한다.Therefore, in the present disclosure, in order to track the movement of the person, a clustering operation is performed. This will be described below with reference to FIG. 6.

도 6은 검출된 얼굴 이미지를 이용한 클러스터링 방식을 설명하기 위한 도면이다.6 is a diagram illustrating a clustering method using the detected face image.

도 6을 참조하면, 추출된 얼굴 이미지의 일 예를 도시한다. 도시된 예에서는 9개의 추출 이미지만을 도시하였지만, 실제 클러스터링할 이미지는 샘플링될 프레임 각각에 포함된 인물의 얼굴 개수일 수 있다. 예를 들어, 3분짜리 영상을 0.5초 단위로 샘플링하고, 각 영상에 4명이 고정적으로 위치하는 경우, 1440개의 얼굴 이미지를 이용할 수 있다.Referring to FIG. 6, an example of the extracted face image is illustrated. Although only nine extracted images are illustrated in the illustrated example, the image to be clustered may be the number of faces of a person included in each frame to be sampled. For example, when a three-minute image is sampled in units of 0.5 seconds and four people are fixedly positioned in each image, 1440 face images may be used.

이와 같은 얼굴 이미지 각각에 대해서 128차원의 특징 벡터를 추출한다. 이러한 특징 벡터는 얼굴을 특징하기 위한 것으로, 얼굴의 특징 값(눈 크기, 눈과 코 각도 등)과 관련 값일 수 있다.For each of these face images, a 128-dimensional feature vector is extracted. The feature vector is for characterizing a face and may be a value associated with a feature value (eye size, eye and nose angle, etc.) of the face.

특징 벡터가 추출되면, 계산된 벡터값 간의 상호 유사도를 계산하는 동작을 통하여 추출된 얼굴 이미지를 복수의 인물로 분류할 수 있다. 그리고 같은 인물로 묶인 얼굴 이미지의 얼굴 좌표를 연결하면, 인물별 움직임 궤적 데이터를 산출할 수 있다.When the feature vector is extracted, the extracted face image may be classified into a plurality of persons through an operation of calculating mutual similarity between the calculated vector values. When the face coordinates of the face images grouped by the same person are connected, the motion trace data of each person can be calculated.

도 7은 사용자 인터페이스 창의 일 예를 설명하기 위한 도면이다.7 is a diagram for explaining an example of a user interface window.

도 7을 참조하면, 사용자 인터페이스 창(700)은 제1 영역(710), 제2 영역(720), 제3 영역(730)을 포함할 수 있다.Referring to FIG. 7, the UI window 700 may include a first area 710, a second area 720, and a third area 730.

제1 영역(710)은 입력 영상의 전체 구간 중 특정 인물의 얼굴이 감지된 구간을 표시한다. 사용자는 제1 영역을 통하여, 특정 인물에 대한 얼굴이 검출되지 않거나, 얼굴이 감지된 시간 구간을 확인할 수 있으며, 예를 들어, 얼굴이 검출되지 않은 구간만을 확인하여 얼굴 검출되지 않은게 맞는지, 반대로 얼굴이 검출된 시간 구간을 선택하여 얼굴 검출 및 분류가 제대로 수행된 것인지를 확인할 수 있다.The first area 710 displays a section in which a face of a specific person is detected among all sections of the input image. The user may check a time section in which a face for a specific person is not detected or a face is detected through the first region. For example, the user may check only a section in which a face is not detected to confirm that a face is not detected. The detected time interval may be selected to determine whether face detection and classification are properly performed.

도시된 바와 같이 본 개시의 제1 영역(710)은 세로형 타임 라인이다. 즉, 가로형 타임 라인 형태의 Non-Linear Editing System(이하 NLE) 프로그램과 달리 편집시점을 기준으로 연속된 프레임 정지이미지를 세로로 펼쳐서 나열하고 마우스 휠로 움직여 얼굴 위치 검출결과의 정상 여부를 빠르게 검수할 수 있는 세로형 타임 라인이다.As shown, the first region 710 of the present disclosure is a vertical timeline. In other words, unlike the non-linear editing system (NLE) program in the horizontal timeline, continuous frame still images are arranged vertically based on the editing time point, and the mouse wheel can be used to quickly check whether the face position detection result is normal. It is a vertical timeline.

제2 영역(720)은 입력 영상의 특정 프레임 및 특정 프레임에서의 감지된 얼굴을 표시한다. 이때 제2 영역은 도 7에 도시한 바와 같이 해당 특정 프레임 직전 또는 직후의 프레임 영상을 함께 표시할 수도 있다. 예를 들어, 연속된 프레임은 편집시점을 기준으로 현재, 전, 후 총 3개 이상의 프레임을 사용할 수 있다. 또한, 원본 영상을 실시간으로 확인하며 얼굴 위치 검수 및 보정할 수 있으므로 인물의 동선을 쉽게 파악할 수 있다.The second area 720 displays a specific frame of the input image and a detected face in the specific frame. In this case, as illustrated in FIG. 7, the second region may also display a frame image immediately before or after the specific frame. For example, consecutive frames may use three or more frames in total, before, and after the editing point. In addition, since the original image can be checked in real time and the face position can be checked and corrected, it is easy to identify the movement of the person.

제3 영역(730)은 특정 프레임에서 특정 인물에 대응되는 추출 이미지를 표시하는 영역이다. 사용자는 제3 영역(730)을 통하여 최종 결과물(세로 영상)을 확인할 수 있다.The third area 730 is an area for displaying an extracted image corresponding to a specific person in a specific frame. The user may check the final result (vertical image) through the third area 730.

한편, 구현시에 사용자 인터페이스 창(700)에는 인물을 선택하기 위한 영역을 포함할 수 있으며, 상술한 제1 영역 내지 제3 영역이 인물별로 복수개 동시에 표시될 수도 있다. 또한, 구현시에는 상술한 제1 영역 내지 제3 영역 중 어느 하나가 생략되는 형태로도 구현될 수 있다.In an implementation, the user interface window 700 may include an area for selecting a person, and a plurality of first to third areas may be simultaneously displayed for each person. In addition, the implementation may be implemented in a form in which any one of the first region to the third region is omitted.

도 8은 본 개시에 따라 생성된 멤버별 콘텐츠와 종래의 콘텐츠의 비교 예를 도시한 도면이다. 구체적으로, 제1 이미지는 종래의 콘텐츠 결과이고, 제2 이미지는 본 개시에 따라 생성된 콘텐츠 결과이다.8 is a diagram illustrating a comparison example of content for each member generated according to the present disclosure with conventional content. Specifically, the first image is a result of conventional content and the second image is a result of content generated in accordance with the present disclosure.

도 8을 참조하면, 제1 이미지는 인물의 위치가 우측으로 편향되어 있는데 반해, 제2 이미지는 인물의 위치가 중심에 있는 것을 확인할 수 있다. 본 실시 예에서는 AI 기술을 통한 얼굴 인식으로 정밀한 피사체 트레킹이 가능하기 때문이다. 즉, 인물의 얼굴 좌표를 알고 있기 때문에, 이미지의 상단 중심에 정확하게 고정한 영상을 생성하기 때문이다.Referring to FIG. 8, while the position of the person is biased to the right in the first image, the position of the person may be determined in the second image. This is because in the present embodiment, precise subject tracking is possible by face recognition through AI technology. In other words, since the face coordinates of the person are known, the image is generated accurately fixed at the top center of the image.

특히나, 종래의 직접 촬영 방식은, 촬영 과정에서 피사체를 놓칠 경우, 복구할 방법이 없다. 하지만, 본 개시에 따른 촬영 영상은 무대 전경을 촬영하기 때문에 피사체를 놓칠 경우가 없다는 장점이 있다. 또한, AI 기술을 통하여 트레킹을 수행하기 때문에, 격렬한 댄스를 추는 가수를 트레킹하는 경우에도 정밀한 추적이 가능하다.In particular, in the conventional direct photographing method, if a subject is missed in the photographing process, there is no method of restoring. However, the photographed image according to the present disclosure has an advantage that the subject is not missed because it photographs the foreground of the stage. In addition, since trekking is performed through AI technology, precise tracking is possible even when trekking a vigorous dance singer.

한편, 도 8을 도시하고 설명함에 있어서, 최종 이미지 생성 과정에서 정밀한 트레킹만을 수행하여 인물별 콘텐츠를 생성하는 것으로 도시하고 설명하였지만, 구현시에는 사용자의 설정 또는 의도적으로 사람이 트레킹한 것처럼 표현되도록 인물별 콘텐츠를 생성하는 것도 가능하다. 즉, 검출된 얼굴 위치에 따라 다양한 방식으로 키프레임을 산출하는 것이 가능하며, 구현시에는 그에 따라 다양한 콘텐츠를 산출하는 것도 가능하다. 예를 들어, 사용자가 초보, 중급. 고급 중에 어느 하나를 선택하면, 선택한 등급에 대응하여 트레킹한 결과를 출력할 수 있다. 또한, 복수의 등급을 선택한 경우, 전자 장치(100)는 인물 및 등급별 복수의 콘텐츠를 생성하는 것도 가능하다.Meanwhile, in FIG. 8, the figure is described and described as generating content for each person by performing only precise trekking in the final image generation process. However, in the implementation, the person may be expressed as if the person is intentionally set by the user. It is also possible to create star content. That is, it is possible to calculate keyframes in various ways according to the detected face position, and in the implementation, it is also possible to calculate various contents accordingly. For example, users beginner, intermediate. If you select any of the advanced, you can output the results of trekking according to the selected class. In addition, when a plurality of grades are selected, the electronic device 100 may generate a plurality of contents for each person and grade.

또한, 구현시에는 단순히 하나의 인물에 대응되는 콘텐츠만을 출력하는 것이 아니라, 복수의 인물에 대한 이미지를 병합한 병합 콘텐츠를 생성하여 출력하는 것도 가능하다. 또한, 연주자는, 얼굴이 아닌 손에 위치를 검출하고, 얼굴 이외에 신체 부위에 대한 콘텐츠를 출력하는 것도 가능하다. 예를 들어, 피아노 연주에 대한 영상이었으면, 피아노 연주자의 손에 대응되는 영상만을 출력하거나, 해당 연주자의 얼굴과 손 각각에 대한 영상을 머징하여 표시하는 것도 가능하다.In addition, the implementation may not only output content corresponding to one person but also generate and output merged content in which images of a plurality of people are merged. In addition, the player can detect the position of the hand rather than the face, and can output content for body parts other than the face. For example, if the image is about playing the piano, only the image corresponding to the piano player's hand may be output or the image of each of the player's face and hand may be merged and displayed.

또한, 도 9에 도시된 바와 같이 여러 악기를 이용한 합주의 경우, 피아노를 치는 제1 인물에 대한 손 영상과 기타를 치는 제2 인물에 대한 손 영상을 병합하여 출력하는 것도 가능하다.In addition, as shown in FIG. 9, in the case of an ensemble using various musical instruments, a hand image of a first person playing a piano and a hand image of a second person playing a guitar may be merged and output.

도 10은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다.10 is a diagram for describing a content generation method according to one embodiment of the present disclosure.

입력된 촬영 영상 내에 포함된 복수의 프레임 각각에 포함된 얼굴 이미지를 추출한다(S1010). 기학습된 얼굴 검출 AI 알고리즘을 이용하여, 복수의 프레임 각각에 포함된 얼굴을 검출하고, 검출된 얼굴에 대응되는 얼굴 이미지, 얼굴 이미지의 좌표 정보 및 얼굴 이미지의 크기 정보를 추출할 수 있다.The face image included in each of the plurality of frames included in the input captured image is extracted (S1010). A face included in each of the plurality of frames may be detected using a previously learned face detection AI algorithm, and a face image corresponding to the detected face, coordinate information of the face image, and size information of the face image may be extracted.

그리고 추출된 얼굴 이미지를 복수의 인물로 클러스터링한다(S1020). 구체적으로, 추출된 얼굴 이미지 각각에 대해서 128차원의 특징 벡터를 산출하고, 산출된 특징 벡터의 상호 유사도를 계산하여 추출된 얼굴 이미지를 복수의 인물로 클러스터링할 수 있다.In operation S1020, the extracted face image is clustered into a plurality of persons. In detail, a 128-dimensional feature vector may be calculated for each of the extracted face images, and the similarity of the calculated feature vectors may be calculated to cluster the extracted face images into a plurality of persons.

그리고 동일 인물로 클러스터링된 얼굴 이미지를 기초로 인물별 궤적 정보를 생성한다(S1030). 구체적으로 동일 인물로 클러스터링된 얼굴 이미지를 갖는 프레임 내의 얼굴 좌표 정보를 연결하여 인물별 궤적 정보를 생성할 수 있다.In operation S1030, the trajectory information for each person is generated based on the face images clustered with the same person. In more detail, trajectory information for each person may be generated by connecting face coordinate information in a frame having face images clustered with the same person.

이때, 생성된 궤적 정보를 이용하여 1차적인 검증을 수행할 수 있다. 구체적으로, 생성된 궤적 정보를 이용하여 구간별 이동 속도를 산출하고, 산출된 이동 속도가 기설정된 값 이상을 갖는지를 확인하여 생성된 인물별 궤적 정보를 검증할 수 있다.In this case, the primary verification may be performed using the generated trajectory information. In detail, the movement speed for each section may be calculated using the generated trajectory information, and the generated movement trajectory information may be verified by checking whether the calculated movement speed has a predetermined value or more.

또한, 생성된 궤적 정보를 표시하는 동작을 통하여 사용자로부터 2차 검증을 받을 수 있다. 구체적으로, 입력 영상의 전체 구간 중 특정 인물의 얼굴이 감지된 구간을 표시하는 제1 영역, 입력 영상의 특정 프레임 및 특정 프레임에서의 감지된 얼굴을 표시하는 제2 영역 및 특정 프레임에서 특정 인물에 대응되는 추출 이미지를 표시하는 제3 영역을 포함하는 사용자 인터페이스 창을 표시할 수 있다. 이에 따라 사용자는 표시된 사용자 인터페이스 창을 통하여 생성된 궤적의 확인을 수행할 수 있으며, 확인 과정에서 수정이 필요한 경우, 해당 사용자 인터페이스 창을 통하여 바로 궤적을 수정할 수 있다.In addition, the second verification may be received from the user through the operation of displaying the generated trajectory information. In detail, a first region displaying a section in which a face of a specific person is detected among all sections of the input image, a second frame displaying a detected frame in a specific frame and a specific frame of the input image, and a specific person in a specific frame. A user interface window including a third area displaying a corresponding extracted image may be displayed. Accordingly, the user may check the generated trajectory through the displayed user interface window, and if the modification is needed during the checking process, the trajectory may be directly modified through the corresponding user interface window.

그리고 입력된 촬영 영상 및 생성된 궤적 정보를 이용하여 인물별 콘텐츠를 생성한다(S1040). 그리고, 생성된 콘텐츠를 외부 서버에 전송할 수 있다.In operation S1040, content for each person is generated using the input photographed image and the generated trajectory information. The generated content may be transmitted to an external server.

따라서, 본 실시 예에 따른 파일 전송 방법은 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 도 10과 같은 콘텐츠 생성 방법은 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다.Therefore, in the file transfer method according to the present embodiment, a plurality of contents (ie, content for each person) may be generated and provided to users by using only the captured image generated by one photographing apparatus. 10 may be executed on an electronic device having the configuration of FIG. 2 or may be executed on an electronic device having another configuration.

또한, 상술한 바와 같은 콘텐츠 생성 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the content generation method as described above may be implemented as a program including an executable algorithm that may be executed on a computer, and the above-described program may be stored and provided in a non-transitory computer readable medium. Can be.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium refers to a medium that stores data semi-permanently and is read by a device, not a medium storing data for a short time such as a register, a cache, a memory, and the like. Specifically, programs for performing the above-described various methods may be stored and provided in a non-transitory readable medium such as a CD, a DVD, a hard disk, a Blu-ray disk, a USB, a memory card, a ROM, or the like.

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, while the above has been shown and described with respect to preferred embodiments of the present disclosure, the present disclosure is not limited to the specific embodiments described above, the technical field to which the disclosure belongs without departing from the gist of the present disclosure claimed in the claims. Of course, various modifications can be made by those skilled in the art, and such modifications should not be individually understood from the technical spirit or the prospect of the present disclosure.

1000: 콘텐츠 생성 시스템 10: 촬상 장치
20: 서버100: 전자 장치
110: 통신 장치120: 메모리
130: 디스플레이140: 조작 입력장치
150: 프로세서1000: content generation system 10: imaging device
20: server 100: electronic device
110: communication device 120: memory
130: display 140: operation input device
150: processor

Claims

In an electronic device,
A communication device receiving a captured image;
A memory for storing the input captured image; And
Extracting face images included in each of the plurality of frames included in the input photographed image, clustering the extracted face images into a plurality of persons, and generating track information for each person based on face images clustered into the same person And a processor configured to generate content for each person using the input photographed image and the generated trajectory information.
The processor,
And calculating the movement speed for each section by using the generated trajectory information, and verifying the generated trajectory information for each person by checking whether the calculated movement speed has a predetermined value or more.

delete

The method of claim 1,
A first area indicating a section in which a face of a specific person is detected among all sections of the input image, a second frame indicating a specific frame of the input image and a detected face in the specific frame, and the specific person in the specific frame And a display configured to display a user interface window including a third area displaying an extracted image corresponding to the image.

The method of claim 9,
The first area,
The electronic device as a vertical timeline displaying a section in which the face of the specific person is detected in the vertical direction.

The method of claim 1,
The processor,
And extracting a vertical image for each of the plurality of frames included in the input photographed image for each person having a vertical length longer than a horizontal length, and generating vertical content for each person using the extracted vertical image.

The method of claim 11,
The processor,
The electronic device identifies a face position of a person in a frame by using the generated trajectory information, and extracts an image area corresponding to the face size of the person as a vertical image based on the identified face position.

The method of claim 1,
The processor,
And generating merged content by merging a plurality of person contents among the generated person contents.

In the content creation method,
Extracting a face image included in each of the plurality of frames included in the input captured image;
Clustering the extracted face images into a plurality of persons;
Generating trajectory information for each person based on face images clustered into the same person;
Calculating the movement speed for each section by using the generated locus information, and verifying the generated locus information by checking whether the calculated movement speed has a predetermined value or more; And
And generating content for each person using the input photographed image and the generated trajectory information.

delete

The method of claim 14,
A first area indicating a section in which a face of a specific person is detected among all sections of the input image, a second frame indicating a specific frame of the input image and a detected face in the specific frame, and the specific person in the specific frame And displaying a user interface window including a third area displaying an extracted image corresponding to.

A computer readable recording medium comprising a program for executing a method of creating a content, the method comprising:
The content generation method,
Extracting a face image included in each of the plurality of frames included in the input captured image;
Clustering the extracted face images into a plurality of persons;
Generating trajectory information for each person based on face images clustered into the same person;
Calculating the movement speed for each section by using the generated locus information, and verifying the generated locus information by checking whether the calculated movement speed has a predetermined value or more; And
And generating content for each person by using the input photographed image and the generated trajectory information.