KR102457176B1

KR102457176B1 - Electronic apparatus and method for generating contents

Info

Publication number: KR102457176B1
Application number: KR1020210146650A
Authority: KR
Inventors: 이윤재; 최성우; 이용건; 홍영기; 홍민수
Original assignee: 한국방송공사
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-10-21

Abstract

Disclosed is an electronic device. The electronic device comprises: a communication device for receiving captured images in real time; and a processor for generating a face ID and coordinate information for each of a plurality of face regions included in the input captured images, and generating image content for each of a plurality of faces in real time using the generated face ID and coordinate information and the captured images. The processor detects a face included in an input captured image using a pre-trained face detection deep learning model, extracts a face image corresponding to the detected face and the coordinate information of the face image, extracts the unique value of the face image, and compares the extracted unique value with a pre-stored unique value for each person to determine the face ID of the face image. Therefore, provided are an electronic device and a method for generating content, wherein content can be generated automatically in real time for each person in a captured image.

Description

ELECTRONIC APPARATUS AND METHOD FOR GENERATING CONTENTS

본 개시는 전자 장치 및 콘텐츠 생성 방법에 대한 것으로, 보다 구체적으로는 촬영 영상 내의 인물별 콘텐츠를 실시간으로 자동 생성 가능한 전자 장치 및 콘텐츠 생성 방법에 관한 것이다.The present disclosure relates to an electronic device and a content generating method, and more particularly, to an electronic device capable of automatically generating content for each person in a captured image in real time and a content generating method.

최근에는 다양한 시점을 보여주기 위하여 많은 카메라를 이용하여 방송을 촬영한다. 예를 들어, 리액션 등이 좋은 출연자의 원샷을 촬영하기 위하여, 전경, 리베로 등과 같은 일반적인 카메라뿐만 아니라, 출연자별 전담 카메라를 이용하고 있다. Recently, broadcasts are filmed using many cameras to show various viewpoints. For example, in order to take a one-shot of a performer with good reaction, not only a general camera such as a foreground, a libero, etc., but also a dedicated camera for each performer is used.

최근의 카메라는 고해상도를 지원한다는 점에서, 하나의 카메라에서 촬영된 영상을 편집하여 개인별 영상을 생성하는 다양한 방법이 제안되고 있다. Since recent cameras support high resolution, various methods for generating individual images by editing images captured by one camera have been proposed.

그러나 기존의 방법은 특정 프로그램(예를 들어, 음악 방송)에 대해서만 적용 가능하거나, 촬영이 완료된 영상에 대한 후처리 방식으로만 개별 콘텐츠 생성이 가능하거나, 영상 내에 복수의 인물이 있는 경우 사용하지 못하거나, 복수의 인물 중 한명에 대해서만 개별 영상을 생성할 수 있는 등의 한계점이 존재하였다. However, the existing method cannot be used when only applicable to a specific program (eg, music broadcast), individual content creation is possible only by post-processing of a filmed image, or there are multiple people in the image. Alternatively, there are limitations such as being able to generate an individual image for only one of a plurality of people.

방송사는 음악 프로그램뿐만 아니라, 예능 프로그램, 시사 교양 프로그램, 보도 프로그램 등 다양한 프로그램이 존재하며, 뉴스와 같은 프로그램 또는 라이브 인터넷 방송 등은 실시간 처리가 요구된다는 점에서, 다양한 프로그램에 적용할 수 있는 인물별 콘텐츠 생성 방법이 요구되었다. Broadcasters have various programs such as entertainment programs, current affairs programs, and news programs as well as music programs, and since programs such as news or live Internet broadcasting require real-time processing, it is possible to A content creation method was required.

따라서, 본 개시의 목적은 촬영 영상 내의 인물별 콘텐츠를 실시간으로 자동 생성 가능한 전자 장치 및 콘텐츠 생성 방법을 제공하는 데 있다. Accordingly, an object of the present disclosure is to provide an electronic device capable of automatically generating content for each person in a captured image in real time and a content generating method.

이상과 같은 목적을 달성하기 위한 본 개시에 따른 전자 장치는 촬영 영상을 실시간으로 입력받는 통신 장치, 상기 입력된 촬영 영상 내에 포함된 복수의 얼굴 영역 각각에 대한 얼굴 ID 및 좌표 정보를 생성하고, 상기 생성된 얼굴 ID 및 좌표 정보와 상기 촬영 영상을 이용하여 상기 복수의 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성하는 프로세서를 포함하고, 상기 프로세서는 기학습된 얼굴 검출 딥 러닝 모델을 이용하여, 상기 입력된 촬영 영상 내에 포함된 얼굴을 검출하고, 검출된 얼굴에 대응되는 얼굴 이미지, 상기 얼굴 이미지의 좌표 정보를 추출하고, 상기 얼굴 이미지에 대한 고유 값을 추출하고, 상기 추출된 고유 값과 기저장된 인물별 고유 값을 비교하여 상기 얼굴 이미지에 대한 얼굴 ID를 결정한다. An electronic device according to the present disclosure for achieving the above object is a communication device that receives a captured image in real time, generates face ID and coordinate information for each of a plurality of face regions included in the input captured image, and and a processor for generating image content for each of the plurality of faces in real time by using the generated face ID and coordinate information and the captured image, wherein the processor uses a pre-learned face detection deep learning model to receive the input Detects a face included in the captured image, extracts a face image corresponding to the detected face, coordinate information of the face image, extracts a unique value for the face image, and extracts the extracted unique value and a pre-stored person A face ID for the face image is determined by comparing the unique values of each star.

이 경우, 상기 프로세서는 기설정된 시간 주기 단위로 상기 촬영 영상에 대한 프레임 영상을 추출하고, 상기 추출된 프레임 영상에 포함된 복수의 얼굴 영역 각각에 대한 얼굴 ID 및 좌표 정보를 생성하고, 상기 기설정된 시간 주기 내에서는 상기 얼굴 ID 및 좌표 정보를 이용하여 상기 복수의 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성할 수 있다. In this case, the processor extracts a frame image for the captured image in units of a preset time period, generates face ID and coordinate information for each of a plurality of face regions included in the extracted frame image, and Within a time period, image content for each of the plurality of faces may be generated in real time using the face ID and coordinate information.

이 경우, 상기 프로세서는 상기 추출된 프레임 영상에 대한 기설정된 전처리 동작을 수행하고, 상기 전처리 동작이 수행된 프레임 영상에 포함된 복수의 얼굴 영역 각각에 대한 얼굴 ID 및 좌표 정보를 생성할 수 있다. In this case, the processor may perform a preset pre-processing operation on the extracted frame image, and generate face ID and coordinate information for each of a plurality of face regions included in the frame image on which the pre-processing operation is performed.

이 경우, 상기 전처리 동작은 이미지 파일 포맷을 [Channel, Width, Height] RGB 형식으로 변환하는 처리, 이미지 해상도를 기설정된 크기로 줄이는 처리 중 적어도 하나를 포함할 수 있다. In this case, the pre-processing operation may include at least one of a process of converting an image file format to a [Channel, Width, Height] RGB format and a process of reducing the image resolution to a preset size.

한편, 상기 프로세서는 상기 입력된 촬영 영상을 상기 기학습된 얼굴 검출 딥 러닝 모델에 입력하여, 얼굴 위치 및 상기 얼굴 위치에 대한 확률에 대한 목록 정보를 취득하고, 기설정된 값보다 낮은 확률을 갖는 얼굴 위치는 필터링하고, Non-maximum Suppression(NMS)를 수행하여 위치가 상호 겹치는 얼굴을 통합하여, 상기 촬영 영상 내의 얼굴을 검출할 수 있다. On the other hand, the processor inputs the inputted captured image to the pre-learned face detection deep learning model, obtains list information about the face position and the probability of the face position, and a face having a lower probability than a preset value The faces in the captured image may be detected by filtering the positions and performing Non-Maximum Suppression (NMS) to integrate faces with overlapping positions.

한편, 상기 프로세서는 직전 검출된 좌표 정보와 상기 좌표 정보를 비교하고, 상기 좌표 정보의 이동이 기설정된 값 이상인 경우에, 변경된 좌표 정보를 이용하여 영상 콘텐츠를 생성할 수 있다. Meanwhile, the processor may compare the immediately detected coordinate information with the coordinate information, and when the movement of the coordinate information is equal to or greater than a preset value, the processor may generate image content using the changed coordinate information.

한편, 상기 프로세서는, 상기 복수의 얼굴 각각에 대한 영상 콘텐츠 각각이 서로 다른 출력포트로 송출되도록 상기 통신 장치를 제어할 수 있다. Meanwhile, the processor may control the communication device to transmit the image content for each of the plurality of faces to different output ports.

한편, 상기 프로세서는, 상기 생성된 복수의 얼굴 각각에 대한 영상 콘텐츠 중 적어도 2개를 병합하여 병합 콘텐츠를 생성할 수 있다. Meanwhile, the processor may generate merged content by merging at least two of the generated image content for each of the plurality of faces.

한편, 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법은 촬영 영상을 실시간으로 입력받는 단계, 기학습된 얼굴 검출 딥 러닝 모델을 이용하여, 상기 입력된 촬영 영상 내에 포함된 얼굴을 검출하고, 검출된 얼굴에 대응되는 얼굴 이미지를 생성하고, 상기 얼굴 이미지의 좌표 정보를 결정하는 단계, 상기 얼굴 이미지에 대한 고유 값을 추출하고, 상기 추출된 고유 값과 기저장된 인물별 고유 값을 비교하여 상기 얼굴 이미지에 대한 얼굴 ID를 결정하는 단계, 상기 얼굴 ID, 상기 좌표 정보와 상기 촬영 영상을 이용하여 상기 검출된 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성하는 단계를 포함한다. On the other hand, the content generation method according to an embodiment of the present disclosure includes the steps of receiving a captured image in real time, detecting a face included in the input captured image, using a pre-learned face detection deep learning model, and generating a face image corresponding to a face, determining coordinate information of the face image, extracting a unique value for the face image, and comparing the extracted unique value with a pre-stored unique value for each person to obtain the face image determining a face ID for , and generating image content for each of the detected faces in real time using the face ID, the coordinate information, and the captured image.

이 경우, 본 콘텐츠 생성 방법은 기설정된 시간 주기 단위로 상기 촬영 영상에 대한 프레임 영상을 추출하는 단계를 더 포함하고, 상기 얼굴 이미지의 좌표 정보를 결정하는 단계는, 상기 추출된 프레임 영상에 포함된 복수의 얼굴 영역 각각에 대한 얼굴 ID 및 좌표 정보를 생성하고, 상기 생성하는 단계는 상기 기설정된 시간 주기 내에서는 상기 얼굴 ID 및 좌표 정보를 이용하여 상기 복수의 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성할 수 있다. In this case, the content creation method further includes extracting a frame image for the captured image in units of a preset time period, and determining the coordinate information of the face image includes: Generating face ID and coordinate information for each of the plurality of face regions, wherein the generating includes generating image content for each of the plurality of faces in real time using the face ID and coordinate information within the preset time period can do.

이 경우, 상기 추출하는 단계는 상기 추출된 프레임 영상에 대한 기설정된 전처리 동작을 수행하고, 상기 전처리 동작이 수행된 프레임 영상에 포함된 복수의 얼굴 영역 각각에 대한 얼굴 ID 및 좌표 정보를 생성할 수 있다. In this case, the extracting may include performing a preset pre-processing operation on the extracted frame image, and generating face ID and coordinate information for each of a plurality of face regions included in the frame image on which the pre-processing operation is performed. have.

한편, 상기 얼굴 이미지의 좌표 정보를 결정하는 단계는, 상기 입력된 촬영 영상을 상기 기학습된 얼굴 검출 딥 러닝 모델에 입력하여, 얼굴 위치 및 상기 얼굴 위치에 대한 확률에 대한 목록 정보를 취득하고, 기설정된 값보다 낮은 확률을 갖는 얼굴 위치는 필터링하고, Non-maximum Suppression(NMS)를 수행하여 위치가 상호 겹치는 얼굴을 통합하여, 상기 촬영 영상 내의 얼굴을 검출할 수 있다. On the other hand, the step of determining the coordinate information of the face image includes inputting the input captured image to the pre-learned face detection deep learning model to obtain list information about the face position and the probability of the face position, Face positions having a lower probability than a preset value may be filtered, and faces in the captured image may be detected by integrating faces with overlapping positions by performing non-maximum suppression (NMS).

한편, 상기 생성하는 단계는 직전 검출된 좌표 정보와 상기 좌표 정보를 비교하고, 상기 좌표 정보의 이동이 기설정된 값 이상인 경우에, 변경된 좌표 정보를 이용하여 영상 콘텐츠를 생성할 수 있다. Meanwhile, in the generating step, the coordinate information detected immediately before may be compared with the coordinate information, and when the movement of the coordinate information is equal to or greater than a preset value, the image content may be generated using the changed coordinate information.

한편, 본 콘텐츠 생성 방법은 상기 검출된 얼굴 각각에 대한 영상 콘텐츠를 서로 다른 출력포트로 송출하는 단계를 더 포함할 수 있다. Meanwhile, the method for generating content may further include transmitting image content for each of the detected faces to different output ports.

한편, 본 콘텐츠 생성 방법은 상기 생성된 복수의 얼굴 각각에 대한 영상 콘텐츠 중 적어도 2개를 병합하여 병합 콘텐츠를 생성하는 단계를 더 포함할 수 있다. Meanwhile, the method for generating content may further include generating merged content by merging at least two of the generated image content for each of the plurality of faces.

한편, 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록매체에 있어서, 상기 콘텐츠 생성 방법은 촬영 영상을 실시간으로 입력받는 단계, 기학습된 얼굴 검출 딥 러닝 모델을 이용하여, 상기 입력된 촬영 영상 내에 포함된 얼굴을 검출하고, 검출된 얼굴에 대응되는 얼굴 이미지를 생성하고, 상기 얼굴 이미지의 좌표 정보를 결정하는 단계, 상기 얼굴 이미지에 대한 고유 값을 추출하고, 상기 추출된 고유 값과 기저장된 인물별 고유 값을 비교하여 상기 얼굴 이미지에 대한 얼굴 ID를 결정하는 단계, 상기 얼굴 ID, 상기 좌표 정보와 상기 촬영 영상을 이용하여 상기 검출된 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성하는 단계를 포함한다. On the other hand, in a computer-readable recording medium including a program for executing a content generating method according to an embodiment of the present disclosure, the content generating method includes receiving a captured image in real time, deep learning for pre-learned face detection Detecting a face included in the input captured image using a model, generating a face image corresponding to the detected face, determining coordinate information of the face image, and extracting a unique value for the face image and determining a face ID for the face image by comparing the extracted unique value with a pre-stored unique value for each person. and generating video content in real time.

이 경우, 상기 좌표 정보를 결정하는 단계는 상기 입력된 촬영 영상을 상기 기학습된 얼굴 검출 딥 러닝 모델에 입력하여, 얼굴 위치 및 상기 얼굴 위치에 대한 확률에 대한 목록 정보를 취득하고, 기설정된 값보다 낮은 확률을 갖는 얼굴 위치는 필터링하고, Non-maximum Suppression(NMS)를 수행하여 위치가 상호 겹치는 얼굴을 통합하여, 상기 촬영 영상 내의 얼굴을 검출할 수 있다. In this case, the determining of the coordinate information includes inputting the inputted captured image to the pre-learned face detection deep learning model, acquiring list information about the face position and the probability of the face position, and a preset value Face positions having a lower probability may be filtered, and faces in the captured image may be detected by integrating faces with overlapping positions by performing non-maximum suppression (NMS).

한편, 상기 생성하는 단계는, 직전 검출된 좌표 정보와 상기 좌표 정보를 비교하고, 상기 좌표 정보의 이동이 기설정된 값 이상인 경우에, 변경된 좌표 정보를 이용하여 영상 콘텐츠를 생성할 수 있다. Meanwhile, in the generating, the coordinate information detected immediately before may be compared with the coordinate information, and when the movement of the coordinate information is equal to or greater than a preset value, the image content may be generated using the changed coordinate information.

한편, 상기 콘텐츠 생성 방법은 상기 복수의 얼굴 각각에 대한 영상 콘텐츠 각각을 서로 다른 출력포트로 송출하는 단계를 더 포함할 수 있다. Meanwhile, the method for generating content may further include transmitting each of the image content for each of the plurality of faces to different output ports.

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 하나의 촬영 영상만으로 실시간으로 다수의 인물별 콘텐츠를 생성할 수 있는바, 비용 절감이 가능하다. As described above, according to various embodiments of the present disclosure, it is possible to generate a plurality of content for each person in real time with only one captured image, thereby reducing costs.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면,
도 3은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면,
도 4는 도 3의 콘텐츠 생성 방법을 보다 구체적으로 설명하기 위한 흐름도,
도 5는 본 개시의 다른 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면,
도 6은 기존의 촬영 환경에서의 카메라 구성과 본 개시에 따른 촬영 환경이 적용된 경우의 카메라 구성도를 비교한 도면,
도 7은 본 개시에 따라 생성된 콘텐츠의 예를 도시한 도면,
도 8은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다. 1 is a view showing the configuration of a content creation system according to an embodiment of the present disclosure;
2 is a diagram illustrating a detailed configuration of an electronic device according to an embodiment of the present disclosure;
3 is a view for explaining a content creation method according to an embodiment of the present disclosure;
4 is a flowchart for explaining the content creation method of FIG. 3 in more detail;
5 is a view for explaining a content creation method according to another embodiment of the present disclosure;
6 is a view comparing the configuration of a camera in an existing shooting environment and a camera configuration when a shooting environment according to the present disclosure is applied;
7 is a diagram illustrating an example of content generated according to the present disclosure;
8 is a view for explaining a content creation method according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다.Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the embodiments of the present disclosure have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but this may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. . In addition, in specific cases, there are also terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Embodiments of the present disclosure may apply various transformations and may have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of the specific embodiments, and it should be understood to include all transformations, equivalents and substitutions included in the spirit and scope of the disclosure. In describing the embodiments, if it is determined that a detailed description of a related known technology may obscure the subject matter, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this application, "includes." Or "consistent." The term such as is intended to designate that there is a feature, number, step, action, component, part, or combination thereof described in the specification, but one or more other features or number, step, action, component, part or It should be understood that it does not preclude the possibility of the existence or addition of combinations thereof.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment of the present disclosure, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented by at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, embodiments of the present disclosure will be described in detail so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면이다.1 is a diagram illustrating the configuration of a content creation system according to an embodiment of the present disclosure.

도 1을 참조하면, 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템(1000)은 촬영 장치(10), 전자 장치(100))로 구성된다.Referring to FIG. 1 , a content creation system 1000 according to an embodiment of the present disclosure includes a photographing device 10 and an electronic device 100 .

촬영 장치(10)는 이미지를 촬상하여 촬영 영상을 생성한다. 이러한 촬영 장치(10)는 이미지를 촬상할 수 있는 디지털 카메라, 캠코더, 휴대폰, PMP, 웹캠, 블랙박스 등일 수 있다.The photographing apparatus 10 captures an image to generate a photographed image. The photographing device 10 may be a digital camera, camcorder, mobile phone, PMP, webcam, black box, or the like capable of capturing an image.

그리고 촬영 장치(10)는 복수의 인물을 피사체로 하여 촬영할 수 있는 위치에 배치되어 촬영 동작을 수행할 수 있다. 이때 촬영 장치(10)는 고정된 방향으로만 촬영을 수행할 수도 있으며, 인물들의 동선 변화에 대응하여 방향을 가변하여 촬영을 수행할 수도 있다. In addition, the photographing apparatus 10 may be disposed at a position capable of photographing a plurality of persons as subjects to perform a photographing operation. In this case, the photographing apparatus 10 may perform photographing only in a fixed direction, or may perform photographing by changing the direction in response to a change in the movement of the people.

한편, 동일 선상에 복수의 인물이 위치하고 촬영 거리가 먼 경우, 촬영 영상 내의 중앙과 주변은 초점이 미세하게 다를 수 있다. 이러한 경우, 중앙에 위치하는 인물에 대한 영상 품질과 주변에 위치하는 인물에 대한 영상 품질이 다를 수 있다. 이러한 점을 해결하기 위하여, 복수의 인물은 촬영 장치(10)를 중심으로 동일한 거리에 위치하도록 배치되거나, 촬영된 영상에 대해서 영역별로 다른 영상 처리를 수행하여, 후술하는 과정에서 생성되는 인물별 콘텐츠의 품질 차이를 방지할 수 있다. On the other hand, when a plurality of people are located on the same line and the shooting distance is long, the focus may be slightly different between the center and the periphery in the captured image. In this case, the image quality of the person located in the center and the image quality of the person located in the vicinity may be different. In order to solve this problem, a plurality of persons are arranged to be located at the same distance from the photographing device 10 , or different image processing is performed for each area on the captured image, and content for each person is generated in a process to be described later. quality differences can be avoided.

그리고 촬영 장치(10)는 생성된 촬영 영상을 전자 장치(100)에 전송할 수 있다. 여기서 촬영 영상을 고화질의 동영상이며, 4K 이상(예를 들어, 8K, 6K, 4K)의 해상도를 가질 수 있다. 그리고 촬영 영상은 소리와 같은 음원을 포함할 수 있다. 한편, 구현시에 소리는 촬영 장치 이외의 외부 장치(예를 들어, 가수의 마이크 또는 실제 방송 음원)로부터 직접 수신할 수도 있다.In addition, the photographing apparatus 10 may transmit the generated photographed image to the electronic apparatus 100 . Here, the captured image is a high-quality video, and may have a resolution of 4K or higher (eg, 8K, 6K, 4K). In addition, the captured image may include a sound source such as sound. Meanwhile, in implementation, the sound may be directly received from an external device other than the photographing device (eg, a singer's microphone or an actual broadcast sound source).

전자 장치(100)는 촬영 영상을 입력받고, 입력된 촬영 영상을 기초로 영상에 포함된 인물별 콘텐츠를 생성할 수 있다. 구체적으로, 전자 장치(100)는 영상에 포함된 얼굴을 감지하고, 감지된 얼굴에 대한 사용자 인식 및 얼굴 영역에 대한 좌표 정보를 생성하고, 생성된 좌표를 이용하여 인물별 콘텐츠를 생성할 수 있다. 이러한 전자 장치(100)의 구체적인 구성 및 동작에 대해서는 도 2를 참조하여 후술한다.The electronic device 100 may receive a captured image and generate content for each person included in the image based on the input captured image. Specifically, the electronic device 100 may detect a face included in an image, generate user recognition for the detected face and coordinate information on a face region, and generate content for each person using the generated coordinates. . A detailed configuration and operation of the electronic device 100 will be described later with reference to FIG. 2 .

그리고 전자 장치(100)는 생성된 인물별 콘텐츠를 실시간으로 외부에 전송할 수 있다. 이때, 전자 장치(100)는 생성된 인물별 콘텐츠 각각을 서로 다른 출력포트로 송출할 수 있다. In addition, the electronic device 100 may transmit the generated content for each person to the outside in real time. In this case, the electronic device 100 may transmit each of the generated content for each person to different output ports.

이와 같이 본 실시 예에 따른 콘텐츠 생성 시스템(1000)은 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 또한, 인물별 콘텐츠를 생성하는 데 있어서, 하나의 촬영 영상만 필요하다는 점에서 비용 절감이 가능하며, 딥 러닝 기술을 이용하여 인물별 콘텐츠를 생성한다는 점에서 더욱 고품질의 콘텐츠를 더욱 손쉽고 빠르게 생산하는 것이 가능하다. 또한, 고속화 처리에 적합한 딥 러닝 기술을 이용한다는 점에서, 고해상도의 영상을 이용하더라도 실시간으로 인물별 콘텐츠를 생성하는 것이 가능하다. As described above, the content generating system 1000 according to the present embodiment can generate a plurality of content (ie, content for each person) using only a captured image generated by one photographing device and provide it to users. In addition, in creating content for each person, it is possible to reduce costs in that only one shot video is required, and in that content for each person is generated using deep learning technology, it is easier and faster to produce higher quality contents more easily and quickly. it is possible In addition, since a deep learning technology suitable for high-speed processing is used, it is possible to generate content for each person in real time even if a high-resolution image is used.

한편, 도 1을 설명함에 있어서, 촬영 장치(10)와 전자 장치(100)가 상호 직접 연결되는 형태로 도시하였지만, 구현시에 각 구성들은 별도의 외부 구성을 경유하는 형태로 연결될 수 있다. 예를 들어, 촬영 장치(10)에서 촬영된 영상은 영상을 1차적으로 저장하는 레코더, 또는 서버 등에 실시간으로 저장되고, 해당 레코더 또는 서버로부터 해당 영상을 제공받아 인물별 콘텐츠를 생성하는 것도 가능하다. Meanwhile, in the description of FIG. 1 , the photographing device 10 and the electronic device 100 are illustrated in a form in which they are directly connected to each other. For example, an image captured by the photographing device 10 is stored in real time in a recorder or server that primarily stores the image, and it is also possible to generate content for each person by receiving the image from the recorder or server. .

또한, 도 1의 전자 장치(100)가 하나의 장치로 구현되는 것을 설명하였지만, 구현시에 전자 장치(100)는 복수의 장치로 구성될 수도 있다. 예를 들어, 제1 전자 장치가 영상을 입력받고, 인물 검출에 필요한 프레임 정보를 제2 전자 장치에 제공하고, 제2 전자 장치가 제공받은 프레임 정보로 얼굴을 감지하고, 얼굴별 좌표 정보를 생성하여 제1 전자 장치에 제공하고, 제1 전자 장치가 제공받은 정보로 인물별 콘텐츠를 생성하는 형태로 복수의 장치가 상술한 본원의 전자 장치의 동작을 분업하여 수행하는 것도 가능하다. Also, although it has been described that the electronic device 100 of FIG. 1 is implemented as a single device, the electronic device 100 may be implemented as a plurality of devices. For example, the first electronic device receives an image, provides frame information necessary for detecting a person to the second electronic device, detects a face using the frame information provided by the second electronic device, and generates coordinate information for each face It is also possible for a plurality of devices to divide and perform the above-described operation of the electronic device of the present application in the form of providing the information to the first electronic device and generating content for each person using the information provided by the first electronic device.

또한, 도 1에서는 방송국 입장에서 즉, 콘텐츠를 제작하는 과정에서 본 개시에 따른 기술이 적용되는 것으로 설명하였지만, 콘텐츠를 소비(또는 시청)하는 입장에서도 해당 기술을 이용할 수 있다. 예를 들어, 복수의 인물이 포함된 영상 콘텐츠를 시청 중에 해당 기술을 이용하여, 사용자가 지정한 인물 영역에 대한 영상만을 원샷 형태로 시청하는 형태로 활용될 수 있다. In addition, although it has been described in FIG. 1 that the technology according to the present disclosure is applied from the perspective of a broadcasting station, that is, in the process of producing content, the technology can also be used from the viewpoint of consuming (or watching) content. For example, while viewing video content including a plurality of people, the technology may be used to view only an image of a person area designated by the user in a one-shot format.

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면이다.2 is a diagram illustrating a detailed configuration of an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 통신 장치(110), 메모리(120), 디스플레이(130), 조작 입력장치(140) 및 프로세서(150)로 구성될 수 있다. 여기서 전자 장치(100)는 이미지 프로세싱이 가능한 PC, 노트북 PC, 스마트폰, 서버 등일 수 있다.Referring to FIG. 2 , the electronic device 100 may include a communication device 110 , a memory 120 , a display 130 , a manipulation input device 140 , and a processor 150 . Here, the electronic device 100 may be a PC capable of image processing, a notebook PC, a smart phone, a server, or the like.

통신 장치(110)는 촬영 장치(10) 또는 서버(20)와 연결되며, 콘텐츠를 송수신할 수 있다. 구체적으로, 통신 장치(110)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 모바일 장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트를 통하여 접속되는 형태도 가능하다.The communication device 110 is connected to the photographing device 10 or the server 20, and may transmit/receive content. Specifically, the communication device 110 is formed to connect the electronic device 100 with an external device, and is connected to a mobile device through a local area network (LAN) and an Internet network, as well as a USB (Universal) device. It is also possible to connect through a serial bus) port.

또한, 통신 장치(110)는 유선 방식뿐만 아니라, 공용 인터넷망에 연결되는 라우터 또는 공유기를 경유하여 다른 전자 장치에 연결될 수 있으며, 라우터 또는 공유기와는 유선 방식뿐만 아니라 와이파이, 블루투스, 셀룰러 통신 등의 무선 방식으로도 연결될 수 있다. 여기서 셀룰러 통신은 예를 들면, LTE(Long-Term Evolution), LTE-A(LTE Advance), 5G-Advanced, NR(New Radio), CDMA(code division multiple access), WCDMA(wideband CDMA), UMTS(universal mobile telecommunications system), WiBro(Wireless Broadband), 또는 GSM(Global System for Mobile Communications)을 포함할 수 있다.In addition, the communication device 110 may be connected to other electronic devices via a router or router connected to a public Internet network as well as a wired method, and may use a wired method with the router or router as well as Wi-Fi, Bluetooth, cellular communication, etc. It can also be connected wirelessly. Here, the cellular communication is, for example, Long-Term Evolution (LTE), LTE Advance (LTE-A), 5G-Advanced, New Radio (NR), code division multiple access (CDMA), wideband CDMA (WCDMA), UMTS ( universal mobile telecommunications system), Wireless Broadband (WiBro), or Global System for Mobile Communications (GSM).

통신 장치(110)는 촬영 장치(10)로부터 촬영 영상을 입력받을 수 있다. 이러한 촬영 영상은 다양한 영상 포맷을 가질 수 있다. 이러한 촬영 영상은 복수의 참여자(진행자 또는 패널 등)를 촬영한 영상일 수 있지만, 이에 한정되지 않고 복수의 인물이 등장하는 것이라면, 오페라 무대, 연극, 스포츠 영상 등 다양할 수 있다.The communication device 110 may receive a captured image from the photographing device 10 . Such a captured image may have various image formats. The captured image may be an image of a plurality of participants (a host or panel, etc.), but is not limited thereto, and may be various, such as an opera stage, a play, a sports image, etc., as long as a plurality of people appear.

그리고 통신 장치(110)는 전자 장치(100)에서 생성한 인물별 콘텐츠를 외부로 송출할 수 있다. 이와 같은 통신 장치(110)는 촬영 영상의 입력 및 인물별 콘텐츠를 외부로 송출하기 위하여, 복수의 입출력 포트를 가질 수 있다. 예를 들어, 상술한 입출력 포트는 HDMI, DP 등과 같은 SDI(Serial digital interface)일 수 있다. 예를 들어, 통신 장치(110)는 8개 이상의 입출력 포트를 가질 수 있으며, 이 경우, 4개의 입출력 포트는 8K 해상도의 영상을 입력받는데 이용될 수 있으며, 나머지 4개의 입출력 포트는 4K 영상을 개별적으로 송출할 수 있다. 이와 같은 입출력 포트의 개수는 입력 영상이 8K 영상이고, 출력 영상이 4K이며, 하나의 입출력 포트의 통신 속도가 12G의 성능을 갖는 경우를 가정한 것이나, 입출력 포트의 통신 속도나 이용하는 영상의 해상도에 따라 입출력 포트의 개수는 변경될 수 있다. 또한, 이하에서는 설명을 용이하게 하기 위하여, 4개의 인물별 영상을 생성하는 것을 가정하여 4개의 인물별 영상을 생성하는 것을 설명하나, 구현시에는 4개 미만 또는 5개 이상의 인물별 영상을 생성하여 송출하는 것도 가능하다. In addition, the communication device 110 may transmit content for each person generated by the electronic device 100 to the outside. Such a communication device 110 may have a plurality of input/output ports in order to input a captured image and transmit content for each person to the outside. For example, the above-described input/output port may be a serial digital interface (SDI) such as HDMI or DP. For example, the communication device 110 may have eight or more input/output ports. In this case, the four input/output ports may be used to receive an image of 8K resolution, and the remaining four input/output ports may individually receive a 4K image. can be sent to The number of such input/output ports assumes that the input video is 8K video, the output video is 4K, and the communication speed of one input/output port has the performance of 12G, but depends on the communication speed of the input/output port or the resolution of the video used. Accordingly, the number of input/output ports may be changed. In addition, for ease of explanation, the generation of four images for each person will be described on the assumption that four images for each person are generated. It is also possible to send

메모리(120)는 전자 장치(100)를 구동하기 위한 O/S나 인물별 콘텐츠를 생성하기 위한 소프트웨어, 딥 러닝 모델, 데이터 등을 저장하기 위한 구성요소이다. 메모리(120)는 RAM이나 ROM, 플래시 메모리, HDD, 외장 메모리, 메모리 카드 등과 같은 다양한 형태로 구현될 수 있으며, 어느 하나로 한정되는 것은 아니다.The memory 120 is a component for storing O/S for driving the electronic device 100 or software for generating content for each person, a deep learning model, data, and the like. The memory 120 may be implemented in various forms such as RAM, ROM, flash memory, HDD, external memory, memory card, etc., but is not limited thereto.

메모리(120)는 촬영 영상을 저장한다. 구체적으로, 메모리(120)는 통신 장치(110)를 통하여 수신한 촬영 영상을 저장할 수 있다. 그리고 메모리(120)는 영상 편집을 위하여 촬영 영상에서 추출된 복수의 프레임을 임시 저장할 수 있으며, 최종 생성된 인물별 콘텐츠를 저장할 수 있다. 여기서 촬영 영상은 MP4, AVI, MOV 등과 같은 동영상 파일일 수 있다.The memory 120 stores the captured image. Specifically, the memory 120 may store the captured image received through the communication device 110 . In addition, the memory 120 may temporarily store a plurality of frames extracted from a captured image for image editing, and may store the finally generated content for each person. Here, the captured image may be a moving picture file such as MP4, AVI, or MOV.

또한, 메모리(120)는 콘텐츠 생성 과정에서 산출되는 추출된 얼굴 이미지, 얼굴 좌표, 인물별 ID 정보(예를 들어, ID 및 각 ID 별 얼굴 특징값 정보)등을 저장할 수 있다.In addition, the memory 120 may store the extracted face image, face coordinates, and ID information for each person (eg, ID and facial feature value information for each ID) calculated in the content creation process.

그리고 메모리(120)는 얼굴 이미지 검출 및 얼굴 매칭을 위한 딥 러닝 모델(또는 프로그램)을 저장할 수 있다. 여기서 얼굴 이미지 검출을 위한 딥 러닝 모델은 이미지 내에서 얼굴 영역을 검출하고, 검출된 얼굴 영역에 대응하는 좌표 정보(예를 들어, 얼굴 영역을 모두 포함하는 최소 크기의 사각형의 꼭지점 좌표값)을 생성하는 모델일 수 있다. 또한, 얼굴 매칭 과정에서는 얼굴 영역에 대응되는 이미지를 이용하여 특징값을 추출하고, 추출된 고유 값을 비교하여 동일 얼굴인지를 비교할 수 있는 모델일 수 있다. In addition, the memory 120 may store a deep learning model (or program) for face image detection and face matching. Here, a deep learning model for detecting a face image detects a face region in an image, and generates coordinate information corresponding to the detected face region (eg, a vertex coordinate value of a minimum size rectangle including all face regions). It can be a model that In addition, in the face matching process, it may be a model capable of extracting a feature value using an image corresponding to a face region and comparing the extracted unique value to compare whether the face is the same.

디스플레이(130)는 전자 장치(100)가 지원하는 기능을 선택받기 위한 사용자 인터페이스 창을 표시한다. 구체적으로, 디스플레이(130)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 이러한 디스플레이(130)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력장치(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다.The display 130 displays a user interface window for receiving a selection of a function supported by the electronic device 100 . Specifically, the display 130 may display a user interface window for receiving selections of various functions provided by the electronic device 100 . The display 130 may be a monitor such as LCD, CRT, or OLED, and may be implemented as a touch screen capable of simultaneously performing the functions of the manipulation input device 140 to be described later.

그리고 디스플레이(130)는 입력된 촬영 영상을 표시하거나, 생성된 인물별 콘텐츠를 표시할 수 있다. In addition, the display 130 may display an input captured image or may display generated content for each person.

조작 입력장치(140)는 사용자로부터 전자 장치(100)의 기능 선택 및 해당 기능에 대한 제어 명령을 입력받을 수 있다. 구체적으로, 조작 입력장치(140)는 산출된 궤적 정보를 수정하는 사용자 제어 명령을 입력받을 수 있다. 이러한 사용자 제어 명령은 상술한 사용자 인터페이스 창을 통하여 입력될 수 있다.The manipulation input device 140 may receive, from a user, a function selection of the electronic device 100 and a control command for the corresponding function. Specifically, the manipulation input device 140 may receive a user control command for correcting the calculated trajectory information. Such a user control command may be input through the user interface window described above.

프로세서(150)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(150)는 사용자로부터 부팅 명령이 입력되면, 메모리(120)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다.The processor 150 controls each component in the electronic device 100 . Specifically, when a booting command is input by the user, the processor 150 may perform booting using the operating system stored in the memory 120 .

프로세서(150)는 촬영 영상이 입력되면, 입력된 촬영 영상에 포함된 복수의 프레임을 추출한다. 이때, 복수의 프레임은 촬영 영상에 포함된 모든 프레임일 수 있으며, 기설정된 시간 간격(예를 들어, 0.5초~1초)단위로 추출된 프레임일 수 있다. 추출된 프레임은 BMP, base 64로 인코딩된 JPG 등 이미지 포맷을 가질 수 있다.When a captured image is input, the processor 150 extracts a plurality of frames included in the input captured image. In this case, the plurality of frames may be all frames included in the captured image, and may be frames extracted in units of a preset time interval (eg, 0.5 seconds to 1 second). The extracted frame may have an image format such as BMP or base 64 encoded JPG.

그리고 프로세서(150)는 입력된 촬영 영상 내에 포함된 복수의 프레임 중 기설정된 시간 간격 단위로 추출된 프레임을 이용하여 얼굴 검출 동작 등을 수행할 수 있다. 구체적으로, 얼굴 검출 등의 작업에는 많은 리소스가 필요하며, 영상에 포함된 참여자의 위치 변화는 초 단위 내에서 그리 크지 않다. 예를 들어, 딥 러닝 인식 주기가 짧으면 영상의 끊김이 발생할 수 있으며, 주기가 너무 길면 빠른 움직임을 추적하지 못할 수 있다. 따라서, 딥 러닝 엔진의 인식 주기를 최적화하는 것이 주요하며, 본 개시에 따른 환경에서는 0.5초~1초 사이의 주기가 상술한 조건을 만족하였다. 그러나 상술한 주기는 일 실시예에 불가하며, 전자 장치(100)의 성능 및 촬영 환경(예를 들어, 참여자의 이동이 큰 경우)에 따라 상술한 주기는 변경될 수 있다. In addition, the processor 150 may perform a face detection operation by using a frame extracted in units of a preset time interval from among a plurality of frames included in the input captured image. Specifically, a lot of resources are required for a job such as face detection, and the change in the position of a participant included in the image is not very large within seconds. For example, if the deep learning recognition cycle is short, image stuttering may occur, and if the cycle is too long, fast movement may not be tracked. Therefore, it is important to optimize the recognition period of the deep learning engine, and in the environment according to the present disclosure, a period between 0.5 second and 1 second satisfies the above-described condition. However, the above-described cycle is not possible according to an exemplary embodiment, and the above-described cycle may be changed according to the performance of the electronic device 100 and the photographing environment (eg, when the movement of a participant is large).

그리고 프로세서(150)는 추출된 프레임에 대한 전처리를 수행할 수 있다. 큰 해상도의 영상을 이용하여 인식을 수행하는 경우, 많은 리소스 요구 및 처리 시간이 길어질 수 있다는 점에서, 추출된 프레임의 데이터 크기를 줄이는 전처리를 수행할 수 있다. 이와 같은 전처리에는 디코딩한 이미지를 [Channel, Width, Height] RGB 형식으로 정렬하는 처리, 960x544 크기로 변환하는 처리 등을 포함할 수 있다. In addition, the processor 150 may perform pre-processing on the extracted frame. When recognition is performed using a large-resolution image, preprocessing may be performed to reduce the data size of the extracted frame in that it requires a lot of resources and may take a long processing time. Such pre-processing may include processing of aligning the decoded image in [Channel, Width, Height] RGB format, processing of converting to 960x544 size, and the like.

그리고 프로세서(150)는 전처리된 영상을 이용하여 얼굴 검출 및 검출된 얼굴에 대한 인물 매칭, 해당 얼굴에 대한 좌표 검출 등을 수행할 수 있다. 이와 같은 동작에 대해서는 도 3을 참조하여 자세히 후술한다. In addition, the processor 150 may perform face detection, person matching with respect to the detected face, and coordinate detection of the corresponding face using the pre-processed image. Such an operation will be described later in detail with reference to FIG. 3 .

그리고 프로세서(150)는 촬영 영상 및 앞서 생성한 인물 ID/좌표 정보를 이용하여 인물별 콘텐츠를 생성한다. 구체적으로, 프로세서(150)는 좌표 정보를 이용하여, 해당 좌표 정보에 해당하는 이미지 영역을 모두 포함하고, 해당 좌표 정보의 중심점을 기준으로 기설정된 가로 크기 및 세로 크기로 영상을 크롭하여 인물별 콘텐츠를 생성할 수 있다. In addition, the processor 150 generates content for each person using the captured image and previously generated person ID/coordinate information. Specifically, the processor 150 uses the coordinate information to include all image regions corresponding to the corresponding coordinate information, and crops the image to a predetermined horizontal size and vertical size based on the center point of the corresponding coordinate information, thereby providing content for each person. can create

이때, 프로세서(150)는 직전 검출된 좌표 정보와 좌표 정보를 비교하고, 좌표 정보의 이동이 기설정된 값 이상인 경우에, 변경된 좌표 정보를 이용하여 영상 콘텐츠를 생성할 수 있다. 구체적으로, 사용자의 미세한 움직임에 대응하여 크롭 영역이 변경되는 경우, 생성된 콘텐츠 내의 떨림이 느껴지는 등의 시청에 불편함이 있을 수 있다. 따라서, 일정 이상의 움직임인 경우에만 크롭되는 영상의 위치를 변경할 수 있다. 이와 같은 동작은 스무스 트래킹이라 지칭될 수 있다. In this case, the processor 150 may compare the immediately detected coordinate information with the coordinate information, and when the movement of the coordinate information is equal to or greater than a preset value, the processor 150 may generate image content using the changed coordinate information. Specifically, when the crop region is changed in response to a user's minute movement, there may be inconvenience in viewing, such as feeling a tremor in the generated content. Accordingly, the position of the cropped image may be changed only in the case of movement of a certain level or more. Such an operation may be referred to as smooth tracking.

또한, 프로세서(150)는 하나의 영상에 한 명의 인물만 포함하는 콘텐츠뿐만 아니라, 두 명 이상의 인물을 포함하는 콘텐츠를 생성하는 것도 가능하며, 인물별 콘텐츠 생성에 이용한 영상과 다른 영상에 생성한 인물 콘텐츠가 PIP 형태로 포함된 영상을 생성하는 것도 가능하다. In addition, the processor 150 may generate not only content including only one person in one image, but also content including two or more people, and the person created in an image different from the image used to generate the content for each person It is also possible to create an image in which the content is included in the PIP format.

또한, 프로세서(150)는 생성된 인물별 콘텐츠에 대한 영상 처리를 수행할 수 있다. 예를 들어, 앞서 설명한 바와 같이 영상의 중심에 위치하는 인물과 주변에 위치하는 인물에 대한 영상 간에 품질 차이가 발생할 수 있는바, 주변에 위치하는 인물에 대한 영상 콘텐츠에 대해서는 초점 보정 등의 이미지 처리를 추가적으로 수행할 수 있다. Also, the processor 150 may perform image processing on the generated content for each person. For example, as described above, there may be a quality difference between the image of a person located in the center of the image and an image of a person located in the vicinity. can be additionally performed.

이와 같이 본 실시 예에 따른 전자 장치(100)는 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 또한, 본 실시 예에 따른 전자 장치(100)는 인물별 콘텐츠를 생성하는 데 있어서, 하나의 촬영 영상만 필요하다는 점에서 비용절감이 가능하며, 딥 러닝 기술을 이용하여 인물별 콘텐츠를 생성한다는 점에서 더욱 고품질의 콘텐츠를 보다 손쉽고 빠르게 생산하는 것이 가능하다.As described above, the electronic device 100 according to the present embodiment can generate a plurality of contents (ie, content for each person) using only a photographed image generated by one photographing device and provide it to users. In addition, the electronic device 100 according to the present embodiment can reduce costs in that only one captured image is required to generate content for each person, and generates content for each person using deep learning technology. It is possible to produce more high-quality content more easily and quickly.

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 얼굴 검출 결과로 프레임에서 얼굴 이미지를 추출하여 이용하는 것으로 설명하였으나, 구현시에는 별도의 얼굴 이미지 추출 동작 없이 얼굴 좌표 정보만을 이용하여, 원본 프레임과 생성된 얼굴 좌표 정보를 이용하여 클러스터링을 수행하는 것도 가능하다. 얼굴 좌표 정보는 사전 검출된 결과나 임의로 설정된 얼굴 좌표, 다른 얼굴 검출 프로그램에서 검출된 얼굴 좌표 정보 등을 활용할 수 있다.On the other hand, in the illustration and description of FIGS. 1 and 2, it has been described that a face image is extracted and used from a frame as a result of face detection, but in implementation, only face coordinate information is used without a separate face image extraction operation, and the original frame It is also possible to perform clustering using the generated face coordinate information. The face coordinate information may utilize a pre-detected result, arbitrarily set face coordinates, or face coordinate information detected in another face detection program.

한편, 도 2를 도시함에 있어서, 전자 장치(100)에 디스플레이 및 조작 입력 장치가 포함되는 것으로 도시하고 설명하였지만, 전자 장치(100)가 서버 등으로 구성되는 경우, 디스플레이 및 조작 입력 장치는 생략될 수 있다. 또한, 전자 장치(100)는 도 2에 도시하지 않은 다른 구성을 더 포함할 수 있으며, 도 2의 기능을 복수의 전자 장치가 분업하여 수행하는 것도 가능하다. Meanwhile, in FIG. 2 , it has been illustrated and described that the electronic device 100 includes a display and a manipulation input device. However, when the electronic device 100 is configured as a server, the display and manipulation input device may be omitted. can In addition, the electronic device 100 may further include other components not shown in FIG. 2 , and it is also possible for a plurality of electronic devices to perform the functions of FIG. 2 by dividing the labor.

도 3은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다. 3 is a view for explaining a content creation method according to an embodiment of the present disclosure.

도 3을 참조하면, 프로세서(200)는 I/O 모듈(210) 및 프로세싱 모듈(220)로 구성될 수 있다. Referring to FIG. 3 , the processor 200 may include an I/O module 210 and a processing module 220 .

I/O 모듈(210)은 외부로부터 영상을 입력받을 수 있다. 구체적으로, I/O 모듈(210)은 외부로부터 제공받은 영상에 포함된 프레임을 기설정된 파일 포맷(예를 들어, base64 jpeg)으로 변환하여 프로세싱 모듈(220)로 전달할 수 있다. 이때, I/O 모듈은 모든 프레임에 대한 기설정된 파일 포맷을 전달할 수도 있고, 복수의 프레임 중 기설정된 주기 단위의 프레임만을 전달할 수도 있다. 한편, 이와 같은 변환 동작은 후술하는 전처리부(221)에서 수행할 수 있다. The I/O module 210 may receive an image from the outside. Specifically, the I/O module 210 may convert a frame included in an image provided from the outside into a preset file format (eg, base64 jpeg) and transmit it to the processing module 220 . In this case, the I/O module may transmit a preset file format for all frames, or may transfer only a frame of a preset cycle unit among a plurality of frames. Meanwhile, such a conversion operation may be performed by the pre-processing unit 221 to be described later.

프로세싱 모듈(220)은 영상 처리 모듈(210)로부터 제공받은 이미지를 이용하여 얼굴 검출을 수행하고, 검출된 얼굴에 기초하여 인물별 콘텐츠를 생성할 수 있다. 이러한 프로세싱 모듈(220)은 전처리부(221), 얼굴인식부(223), 고유값 추출부(225), 얼굴 식별부(227), 콘텐츠 생성부(229)를 포함할 수 있다. The processing module 220 may perform face detection using the image provided from the image processing module 210 and generate content for each person based on the detected face. The processing module 220 may include a preprocessor 221 , a face recognition unit 223 , an eigenvalue extraction unit 225 , a face identification unit 227 , and a content generation unit 229 .

전처리부(221)는 입력된 이미지를 기설정된 형식 및 기설정된 크기로 변환할 수 있다. 여기서 기설정된 형식은 [Channel, Width, Height] RGB 형식, 기설정된 크기는 960x544일 수 있으나, 이는 얼굴 인식부에서 이용하는 딥 러닝 모델에 따라 변경될 수 있다. The preprocessor 221 may convert the input image into a preset format and a preset size. Here, the preset format may be [Channel, Width, Height] RGB format, and the preset size may be 960x544, but this may be changed according to the deep learning model used in the face recognition unit.

얼굴 인식부(225)는 전처리된 이미지 내에서 얼굴을 인식하고, 인식된 얼굴에 대한 얼굴 이미지를 생성하고, 생성된 얼굴에 대응되는 좌표 정보를 생성할 수 있다. 구체적으로, 얼굴 인식부(225)는 얼굴 인식 모델에 전처리된 이미지를 입력하여, 얼굴 위치 및 확률 목록을 확인하고, 확인된 목록 중 확률이 낮은 영역을 필터링하고, Non-maximum Suppression(NMS)을 수행하여, 겹치는 영역이 넓은 부분을 통합하여, 최종적으로 얼굴로 인식된 얼굴 위치, 해당 위치에 대한 좌표를 생성할 수 있다. 본 개시에서 이용하는 얼굴 인식 모델은 15ms로 짧은 시간 만에 얼굴 인식이 가능하여, 프레임 단위(또는 실시간)로 얼굴 인식을 수행하는 것이 가능하다. The face recognition unit 225 may recognize a face in the pre-processed image, generate a face image for the recognized face, and generate coordinate information corresponding to the generated face. Specifically, the face recognition unit 225 inputs the preprocessed image to the face recognition model, checks the face position and probability list, filters the low probability area among the checked list, and performs Non-maximum Suppression (NMS). By doing so, it is possible to integrate a portion having a wide overlapping area to generate a position of a face finally recognized as a face and coordinates for the position. The face recognition model used in the present disclosure can recognize a face in a short time of 15 ms, so it is possible to perform face recognition in units of frames (or in real time).

그리고 얼굴 인식부(225)는 인식된 얼굴에 대한 좌표를 이용하여 입력된 이미지에서 해당 부분을 크롭하여 얼굴 이미지를 생성할 수 있다. 이때, 앞서 생성된 얼굴 좌표는 중심을 기준으로 타이트하게 결정될 수 있으므로, 여유 공간(예를 들어, 32px)을 두고 크롭하여 얼굴 이미지를 생성할 수 있다. In addition, the face recognition unit 225 may generate a face image by cropping a corresponding part of the input image using the coordinates of the recognized face. In this case, since the previously generated face coordinates may be tightly determined based on the center, a face image may be generated by cropping with a free space (eg, 32px).

고유값 추출부(225)는 앞서 생성된 얼굴 이미지를 고유값 추출 딥 러닝 모델을 이용하여 얼굴에 대한 고유값(또는 특징값)을 추출할 수 있다. 이와 같은 얼굴 고유 값은 512 크기의 벡터일 수 있다. The eigenvalue extractor 225 may extract an eigenvalue (or feature value) of a face using the eigenvalue extraction deep learning model from the previously generated face image. Such a face eigenvalue may be a 512-sized vector.

얼굴 식별부(227)는 기저장된 인물별 고유 값과 고유값 추출부(225)에서 생성한 고유 값을 비교하여, 현재 인물별 이미지에 대응되는 인물 ID를 결정한다. 구체적으로, 얼굴 식별부(227)는 기존에 저장된 인물별 고유 값과 현재 생성된 고유 값 간에 유사도를 계산하고, 계산된 유사도에 기초하여, 즉, 가장 높은 순서로 검출된 얼굴을 기존의 저장된 인물에 매칭할 수 있다. The face identification unit 227 compares the pre-stored unique value for each person with the unique value generated by the unique value extraction unit 225 to determine a person ID corresponding to the current person-specific image. Specifically, the face identification unit 227 calculates a degree of similarity between the previously stored unique value for each person and the currently generated unique value, and based on the calculated similarity, that is, the face detected in the highest order from the previously stored person. can be matched to

이때, 얼굴 식별부(227)는 한 번 매칭이 수행될 때마다 매칭된 이물과 얼굴이 포함된 유사도를 목록에서 제거할 수 있다. 예를 들어, 기저장된 인물이 A, B, C 이고, 현재 인식된 3개의 얼굴 이미지가 1, 2, 3인 경우, 처음에 얼굴 이미지에 대해서 얼굴 이미지 1과 기저장된 A, B, C 간의 유사도가 비교되고, 비교 결과, 얼굴 이미지 1이 기저장된 C로 매칭된 경우, 두번째 얼굴 이미지 매칭 과정(즉, 얼굴 이미지 2에 대한 매칭)에서는 기저장된 인물 A, B와만 매칭 작업이 수행될 수 있다. In this case, the face identification unit 227 may remove the similarity including the matched foreign object and the face from the list whenever matching is performed once. For example, if the pre-stored people are A, B, and C, and the currently recognized three face images are 1, 2, and 3, the similarity between the face image 1 and the pre-stored A, B, C with respect to the first face image. is compared, and as a result of the comparison, when face image 1 is matched with pre-stored C, matching may be performed only with pre-stored persons A and B in the second face image matching process (ie, matching with face image 2).

만약, 이와 같은 매칭 과정에서 유사도의 거리가 1.5를 넘어서가나 더이상 매칭할 기존 인물이 없는 경우에는 남은 인물을 새로운 인물로 간주하고, 새로운 인물에 대한 ID를 신설하고, 해당 ID에 대한 고유 값을 저장해 놓을 수 있다. 또한, 동일하게 입력된 영상에 대한 첫번째 인식 과정에서는 기저장된 고유 값이 없는바, 생성된 얼굴 이미지 각각에 대해서 신규 ID를 부여하고, 신규 ID 별 고유 값을 저장해놓을 수 있다. If, in the matching process, the similarity distance exceeds 1.5, but there is no longer an existing person to match, the remaining person is regarded as a new person, a new ID is created for the new person, and a unique value for the ID is stored. can be put Also, since there is no pre-stored unique value in the first recognition process for the same input image, a new ID may be assigned to each generated face image, and a unique value for each new ID may be stored.

콘텐츠 생성부(229)는 얼굴 식별부(227)에서 생성한 인물 ID 및 얼굴 좌표를 포함하는 인물 정보를 제공받고, 제공받은 인물 정보와 영상을 이용하여 인물별 영상을 생성할 수 있다. 이때, 인물 정보는 json 포맷일 수 있다. 예를 들어, 입력된 영상에 3명의 사람이 검출된 경우, 콘텐츠 생성부(229)는 검출된 3명의 사람 각각에 대응되는 인물별 콘텐츠를 생성할 수 있다. 또한, 콘텐츠 생성부(229)는 검출된 3명의 사람 중 복수의 사람(예를 들어, 2명)을 포함하는 영상 콘텐츠를 생성할 수도 있다. The content generating unit 229 may receive the person information including the person ID and the face coordinates generated by the face identification unit 227 , and may generate an image for each person using the provided person information and the image. In this case, the person information may be in a json format. For example, when three people are detected in the input image, the content generator 229 may generate content for each person corresponding to each of the three detected people. Also, the content generator 229 may generate image content including a plurality of people (eg, two people) among the three detected people.

이때, 콘텐츠 생성부(210)는 다음 주기의 인물 정보가 수신되기 전까지 현재 인물 정보를 이용하여 인물별 콘텐츠를 생성하는 것도 가능하고, 상술한 기설정된 주기 내의 영상을 버퍼에 저장하고, 인물 정보를 생성하는 데 이용한 두 프레임(이하, 얼굴 인식 프레임) 사이에 다른 프레임(이하, 일반 프레임)들에 대해서는, 얼굴 인식 프레임에서 인식된 얼굴 좌표를 선형 보간하여 인물별 콘텐츠를 생성할 수도 있다. 이와 같이 선형 보간을 이용하는 경우, 인물의 이동이 큰 경우에도 크롭 영역의 조정을 부드럽게 하는 것이 가능하다. In this case, the content generating unit 210 may generate content for each person using the current person information until the person information of the next cycle is received, and stores the image within the above-described preset cycle in a buffer, and saves the person information. For other frames (hereinafter, general frames) between two frames (hereinafter, referred to as face recognition frames) used for generation, content for each person may be generated by linearly interpolating the face coordinates recognized in the face recognition frame. In the case of using the linear interpolation as described above, it is possible to smooth the adjustment of the crop area even when the movement of the person is large.

한편, 입력 영상에 대해서 일정 시간 동안에 특정 인물에 대한 좌표 값 등이 존재하지 않을 경우가 있다. 예를 들어, 3명의 패널이 참여하여, 입력 영상이 3명의 얼굴이 포함되어 있어야 하나, 3명 중 한 명이 일시적으로 자리를 비우거나, 게스트여서 일시적으로부터 참여한 경우일 수도 있다. On the other hand, there is a case in which the coordinate value of a specific person does not exist for a predetermined time with respect to the input image. For example, it may be a case in which three panelists participate so that the input image must include three faces, but one of the three people temporarily leaves the seat or temporarily participates because it is a guest.

예를 들어, 게스트의 경우라면, 해당 게스트에 대한 얼굴일 검출되는 영역에서만 해당 게스트에 대한 영상 콘텐츠를 생성하는 것이 가능하다, 만약 일반 패널이 일시적으로 자리를 비우거나, 물을 마신다거나 고개를 숙여 해당 패널에 대한 얼굴이 검출되지 않을 수도 있다. 이 경우, 콘텐츠 생성부(210)는 특정 패널(즉, 특정 패널에 대한 ID)이 검출되지 않더라도, 기존에 확인된 최종 좌표 정보를 이용하여 지속적으로 해당 인물에 대한 영상 콘텐츠를 생성할 수 있다. 또한, 확인되지 않은 기간이 일정 시간 이상이라면, 해당 인물에 대한 개별적인 콘텐츠 생성 동작을 종료할 수도 있다. For example, in the case of a guest, it is possible to create video content for the guest only in the area where a face for the guest is detected. A face for the corresponding panel may not be detected. In this case, even if a specific panel (ie, an ID for a specific panel) is not detected, the content generating unit 210 may continuously generate image contents for the corresponding person using the previously confirmed final coordinate information. Also, if the unconfirmed period is longer than a predetermined time, the individual content creation operation for the person may be terminated.

한편, 도 3을 참조하면, 본 개시에서는 2개의 딥 러닝 모델을 개별적으로 이용하여 얼굴 감지 및 얼굴 특징점 추출을 수행하는 것으로 도시하였지만, 구현시에는 얼굴 감지 및 감지된 얼굴에 대한 특징점 추출을 하나의 딥 러닝 모델을 이용하는 형태로도 구현할 수 있다. On the other hand, referring to FIG. 3 , in the present disclosure, face detection and facial feature point extraction are performed using two deep learning models individually, but in implementation, face detection and feature point extraction for the detected face are performed in one It can also be implemented in the form of using a deep learning model.

한편, 상술한 바와 같은 동작은 병렬적(또는 파이프라인)으로 수행될 수 있으며, 그와 같은 동작에 대해서는 도 4를 참조하여 이하에서 설명한다. Meanwhile, the above-described operations may be performed in parallel (or pipeline), and such operations will be described below with reference to FIG. 4 .

도 4는 도 3의 콘텐츠 생성 방법을 보다 구체적으로 설명하기 위한 흐름도이다. 4 is a flowchart for describing the content creation method of FIG. 3 in more detail.

도 4를 참조하면, 먼저, 실시간으로 스트림을 입력받을 수 있다(S410). Referring to FIG. 4 , first, a stream may be input in real time ( S410 ).

그리고 입력받은 스트림을 구성하는 프레임별로 전처리를 수행할 수 있다(S420). 전처리 동작에 대해서는 앞서 설명하였는바, 중복 설명은 생략한다. In addition, pre-processing may be performed for each frame constituting the received stream (S420). Since the pre-processing operation has been described above, a redundant description thereof will be omitted.

전처리된 이미지를 이용하여 딥 러닝 파이프 라인을 수행한다(S430). 구체적으로, 해당 단계에서는 복수의 프레임에 대응되는 이미지에 대한 분석 동작이 파이프라인 형태로 수행될 수 있다. 예를 들어, 해당 파이프라인은 4단계로 구성될 수 있으며, 첫번째 단계에서는 N번째 프레임에 대한 이미지에 대한 얼굴 영역 탐지가 수행될 수 있다. 두번째 단계에서는 N-1번째 프레임에 대한 얼굴별 고유값 벡터를 생성하는 동작이 수행될 수 있다. 세번째 단계에서는 N-2번째 프레임에 대해서 기저장된 인물들의 고유값과 N-2번째 프레임에 대해서 산출된 고유값을 비교하는 동작이 수행될 수 있으며, 마지막 네번째 단계에서, N-3 번째 프레임에 포함되는 인물 ID 및 해당 인물에 대한 좌표값 등의 정보를 갖는 등장인물 목록이 생성될 수 있다. A deep learning pipeline is performed using the preprocessed image (S430). Specifically, in the corresponding step, an analysis operation on images corresponding to a plurality of frames may be performed in the form of a pipeline. For example, the pipeline may be composed of 4 stages, and in the first stage, face region detection may be performed on the image of the Nth frame. In the second step, an operation of generating an eigenvalue vector for each face for the N-1 th frame may be performed. In the third step, an operation of comparing the eigenvalues of people pre-stored for the N-2 th frame and the eigenvalues calculated for the N-2 th frame may be performed, and in the last fourth step, they are included in the N-3 th frame A list of characters having information such as a person ID and coordinate values for the corresponding person may be generated.

한편, 이상에서는 한 프레임 단위로 상술한 동작이 수행되는 것으로 설명하였지만, 전자 장치의 리소스에 따라, 상술한 각 동작은 병렬적으로 수행될 수 있다. Meanwhile, although it has been described above that the above-described operations are performed in units of one frame, each of the above-described operations may be performed in parallel depending on the resource of the electronic device.

그리고 슬롯별 파이프라인을 수행한다(S440). 구체적으로, 이 동작은 앞서 설명한 딥 러닝 파이프 라인을 동작과 병렬적으로 수행될 수 있다. 구체적으로, 슬롯에 합성할 인물 목록을 가져오고, 인물의 딥 러닝 파이프라인이 완료된 경우, 해당 크롭 위치를 갱신하고, 갱신된 크롭 위치에 대해 크롭을 수행하고, 후처리를 수행할 수 있다. 그리고 슬롯에 크롭된 대상 인물들을 순서대로 합성하고, 완성된 슬롯을 저장할 수 있다. Then, a pipeline for each slot is performed (S440). Specifically, this operation can be performed in parallel with the operation of the deep learning pipeline described above. Specifically, a list of persons to be synthesized in a slot may be brought, and when the deep learning pipeline of the person is completed, the corresponding crop position may be updated, the updated crop position may be cropped, and post-processing may be performed. In addition, target characters cropped in the slot can be synthesized in order, and the completed slot can be stored.

마지막으로 개별 출력 슬롯을 가져오기를 수행한다(S450). 구체적으로, 개별 출력을 이어 붙여 멀티뷰 출력을 생성하거나, 개별 출력을 각각 송출하여 개별 콘텐츠를 출력할 수 있다. Finally, individual output slots are fetched (S450). Specifically, it is possible to generate a multi-view output by concatenating individual outputs, or output individual contents by transmitting individual outputs.

도 5는 본 개시의 다른 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다. 5 is a view for explaining a content creation method according to another embodiment of the present disclosure.

도 5를 참조하면, 촬영 시스템은 복수의 전자 장치(100-1, 100-2) 및 녹화장치(50)를 포함한다. Referring to FIG. 5 , the imaging system includes a plurality of electronic devices 100 - 1 and 100 - 2 and a recording device 50 .

복수의 전자 장치(100-1, 100-2) 각각은 영상을 입력받고, 입력된 영상에 포함된 인물별 콘텐츠를 생성하고, 생성된 인물별 콘텐츠를 출력할 수 있다. Each of the plurality of electronic devices 100-1 and 100-2 may receive an image, generate content for each person included in the input image, and output the generated content for each person.

이때, 복수의 전자 장치(100-1, 100-2) 각각은 복수의 촬영 영상을 수신하고, 수신된 복수의 촬영 영상 각각 보다 해상도가 높은 스케일업된 영상을 생성하고, 생성된 스케일업된 영상을 이용하여, 인물별 콘텐츠를 출력할 수도 있다. In this case, each of the plurality of electronic devices 100 - 1 and 100 - 2 receives a plurality of captured images, generates a scaled-up image having a higher resolution than each of the received plurality of captured images, and generates the generated scaled-up image. By using , content for each person may be output.

녹화 장치(50)는 복수의 전자 장치(100-1, 100-2)의 입출력 포트에서 출력되는 복수의 콘텐츠를 수신하여, 저장할 수 있다. The recording apparatus 50 may receive and store a plurality of contents output from input/output ports of the plurality of electronic devices 100 - 1 and 100 - 2 .

한편, 도시된 예에서는 4명의 참여자에 대해서 2개의 카메라를 이용하는 것을 도시하였지만, 구현시에는 1개의 카메라를 이용하는 것도 가능하다. 또한, 상술한 예에서는 하나의 전자 장치가 4명의 콘텐츠를 생성하는 것으로 도시하고 설명하였지만, 전자 장치의 성능이 그 이상의 콘텐츠를 처리하는 것이 가능하다면, 하나의 전자 장치로 구현하는 것도 가능하다. 또한, 2개의 전자 장치가 아니라, 3대 이상의 전자 장치가 결합하여 9명 이상의 참여자에 대한 콘텐츠를 실시간으로 생성하는 것도 가능하다. Meanwhile, although the illustrated example shows that two cameras are used for four participants, it is possible to use one camera in implementation. In addition, in the above example, one electronic device has been illustrated and described as generating content for four people, but if the performance of the electronic device is capable of processing more content, it may be implemented with one electronic device. In addition, instead of two electronic devices, it is also possible to combine three or more electronic devices to generate content for nine or more participants in real time.

도 6은 기존의 촬영 환경에서의 카메라 구성과 본 개시에 따른 촬영 환경이 적용된 경우의 카메라 구성도를 비교한 도면이다. 6 is a diagram comparing the configuration of a camera in a conventional photographing environment with a camera configuration in a case in which a photographing environment according to the present disclosure is applied.

도 6을 참조하면, 기존의 촬영 환경은 6명의 출연자가 참여하는 프로그램의 촬영 현장에서 이용하는 카메라 구성도(610)로, 총 11개의 카메라가 이용된다. 구체적으로, 전경, 리베로, 지미집 카메라와 쓰리샷을 위한 2대의 카메라, 원샷을 위한 6대의 카메라가 활용된다. Referring to FIG. 6 , the existing shooting environment is a camera configuration diagram 610 used at a shooting site of a program in which 6 performers participate, in which a total of 11 cameras are used. Specifically, the foreground, Libero, and Jimmy Zip cameras, two cameras for three-shot, and six cameras for one-shot are utilized.

한편, 본 개시에 따른 촬영 환경을 적용하는 경우(620), 쓰리샷을 위한 카메라의 영상을 활용하여 출연자별 영상을 생성할 수 있게 되는바, 원샷을 위한 6대의 카메라가 사용하지 않을 수 있다. 즉, 우측에 도시한 바와 같이 5대의 카메라만을 이용한 촬영이 가능하다. 또한, 전경캠, 리베로, 지미집 등은 기존과 같이 배치하기 때문에, 이벤트나 돌발 상황에 대처하는것에 무리가 없다. On the other hand, when the shooting environment according to the present disclosure is applied ( 620 ), an image for each performer can be generated by using an image of a camera for a three-shot, and six cameras for a one-shot may not be used. That is, as shown on the right, it is possible to shoot using only five cameras. In addition, the foreground cam, Libero, and Jimmy's Zip are arranged as before, so there is no difficulty in dealing with events or unexpected situations.

이와 같은 하나의 프로그램 제작시에 필요한 카메라 대수를 11대에서 5대로 줄일 수 있다. It is possible to reduce the number of cameras required to produce one such program from 11 to 5.

또한, 협소한 장소에서 촬영을 수행하는 경우, 기존과 같이 인물별로 개별적인 카메라로 촬영을 수행하는 경우, 카메라가 서로의 앵글이 겹치는 문제가 발생할 수 있다. 그러나 이러한 환경에서 본 개시와 같이 복수의 참여자를 하나의 카메라로 촬영하고, 촬영 영상에서 인물별 영상을 추출한다면 기존과 같이 앵글이 겹치는 문제를 방지할 수 있다. In addition, when photographing is performed in a narrow place, when photographing is performed with individual cameras for each person as in the past, a problem in which the angles of the cameras overlap each other may occur. However, in such an environment, as in the present disclosure, if a plurality of participants are photographed with a single camera and an image for each person is extracted from the photographed image, the problem of overlapping angles as in the prior art can be prevented.

도 7은 본 개시에 따라 생성된 콘텐츠의 예를 도시한 도면이다. 7 is a diagram illustrating an example of content generated according to the present disclosure.

도 7을 참조하면, 고해상도의 영상(710)이 입력되면, 영상 내에 포함된 인물별 영상(또는 콘텐츠)(710, 720, 730)이 생성될 수 있다. 또한, 생성된 개별 영상은 하나의 인물만 포함되는 형태가 아니라, 두 명 이상의 인물이 포함된 형태의 영상(740)도 생성할 수 있다. Referring to FIG. 7 , when a high-resolution image 710 is input, images (or contents) 710 , 720 , and 730 for each person included in the image may be generated. In addition, the generated individual image may also generate an image 740 in a form including two or more persons, rather than a form including only one person.

그리고 이와 같은 개별 영상을 이용하여 방송 콘텐츠를 손쉽게 생성할 수 있으며, 이러한 개별 영상은 독자적으로 출력될 수도 있으며, 현재의 영상과 다른 영상에 포함되는 형태, 즉 PIP 형태(760)로도 가동되어 최종 방송될 수도 있다. In addition, broadcasting contents can be easily created using such individual images, and these individual images can be independently output, and are also operated in a form included in an image different from the current image, that is, the PIP form 760 to be finally broadcast. it might be

도 8은 본 개시의 일 실시 예에 따른 콘텐츠 생성 방법을 설명하기 위한 도면이다. 8 is a view for explaining a content creation method according to an embodiment of the present disclosure.

도 8을 참조하면, 먼저 촬영 영상을 실시간으로 입력받는다(S810). 구체적으로, 기설정된 시간 주기 단위로 촬영 영상에 대한 프레임 영상을 추출하고, 추출된 프레임 영상에 대한 기설정된 전처리 동작을 수행할 수 있다. 이와 같은 전처리는 이미지 파일 포맷을 [Channel, Width, Height] RGB 형식으로 변환하는 처리, 이미지 해상도를 기설정된 크기로 줄이는 처리 중 적어도 하나를 포함할 수 있다. Referring to FIG. 8 , first, a captured image is received in real time (S810). Specifically, a frame image of the captured image may be extracted in units of a preset time period, and a preset pre-processing operation may be performed on the extracted frame image. Such pre-processing may include at least one of a process of converting an image file format to a [Channel, Width, Height] RGB format and a process of reducing the image resolution to a preset size.

그리고 기학습된 얼굴 검출 딥 러닝 모델을 이용하여, 입력된 촬영 영상 내에 포함된 얼굴을 검출하고, 검출된 얼굴에 대응되는 얼굴 이미지를 생성하고, 얼굴 이미지의 좌표 정보를 결정한다(S820). 구체적으로, 입력된 촬영 영상을 기학습된 얼굴 검출 딥 러닝 모델에 입력하여, 얼굴 위치 및 얼굴 위치에 대한 확률에 대한 목록 정보를 취득하고, 기설정된 값보다 낮은 확률을 갖는 얼굴 위치는 필터링하고, Non-maximum Suppression(NMS)를 수행하여 위치가 상호 겹치는 얼굴을 통합하여, 촬영 영상 내의 얼굴을 검출할 수 있다. Then, using the pre-learned face detection deep learning model, a face included in the input captured image is detected, a face image corresponding to the detected face is generated, and coordinate information of the face image is determined ( S820 ). Specifically, by inputting the input captured image to the pre-trained face detection deep learning model, to obtain list information about the face position and the probability of the face position, the face position having a lower probability than the preset value is filtered, By performing Non-Maximum Suppression (NMS), faces with overlapping positions are integrated, and faces in the captured image can be detected.

그리고 얼굴 이미지에 대한 고유 값을 추출하고, 추출된 고유 값과 기저장된 인물별 고유 값을 비교하여 얼굴 이미지에 대한 얼굴 ID를 결정한다(S830). Then, a unique value for the face image is extracted, and a face ID for the face image is determined by comparing the extracted unique value with a pre-stored unique value for each person (S830).

그리고 얼굴 ID, 좌표 정보와 촬영 영상을 이용하여 복수의 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성한다(S840). 구체적으로, 기설정된 시간 주기 내에서는 생성한 얼굴 ID 및 좌표 정보를 이용하여 복수의 얼굴 각각에 대한 영상 콘텐츠를 실시간으로 생성할 수 있다. Then, image content for each of the plurality of faces is generated in real time by using the face ID, coordinate information, and the captured image (S840). Specifically, within a preset time period, image content for each of a plurality of faces may be generated in real time by using the generated face ID and coordinate information.

그리고 복수의 얼굴 각각에 대한 영상 콘텐츠 각각이 서로 다른 출력포트로 송출할 수 있다. In addition, image content for each of the plurality of faces may be transmitted to different output ports.

또한, 생성된 복수의 얼굴 각각에 대한 영상 콘텐츠 중 적어도 2개를 병합하여 병합 콘텐츠를 생성할 수도 있다. Also, the merged content may be generated by merging at least two of the generated image content for each of the plurality of faces.

따라서, 본 실시 예에 따른 파일 전송 방법은 하나의 촬영 장치에서 생성한 촬영 영상만을 이용하여 복수의 콘텐츠(즉, 인물별 콘텐츠)를 생성하여 사용자들에게 제공하는 것이 가능하다. 도 8과 같은 콘텐츠 생성 방법은 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다.Accordingly, in the file transmission method according to the present embodiment, it is possible to generate a plurality of contents (ie, content for each person) using only a photographed image generated by a single photographing device and provide it to users. The content creation method shown in FIG. 8 may be executed on the electronic device having the configuration of FIG. 2 and may also be executed on the electronic device having other configurations.

또한, 상술한 바와 같은 콘텐츠 생성 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the content creation method as described above may be implemented as a program including an executable algorithm that can be executed on a computer, and the above-described program is stored in a non-transitory computer readable medium to be provided. can

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, memory, etc., and can be read by a device. Specifically, the programs for performing the above-described various methods may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, although preferred embodiments of the present disclosure have been illustrated and described above, the present disclosure is not limited to the specific embodiments described above, and the technical field to which the disclosure belongs without departing from the gist of the present disclosure as claimed in the claims Various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present disclosure.

1000: 콘텐츠 생성 시스템 10: 촬상 장치
20: 디스플레이 장치 100: 전자 장치
110: 통신 장치 120: 메모리
130: 디스플레이 140: 조작 입력장치
150: 프로세서1000: content creation system 10: imaging device
20: display device 100: electronic device
110: communication device 120: memory
130: display 140: operation input device
150: processor

Claims

In an electronic device,
a communication device receiving a stream corresponding to a captured image in real time;
performing a pre-processing operation on a frame constituting the input stream, and generating face ID and coordinate information for each of a plurality of face regions included in the frame using the pre-processed frame image and a deep learning pipeline; A processor for generating image content for each of the plurality of faces in real time by using the generated face ID and coordinate information and the captured image;
The deep learning pipeline is
For the N-th frame, a face included in the input captured image is detected using a pre-learned face detection deep learning model, a face image corresponding to the detected face, and coordinate information of the face image are extracted,
Extracting the eigenvalue for the face image for the N-1th frame,
Comparing the extracted eigenvalue with respect to the N-2th frame and the pre-stored eigenvalue for each person,
An electronic device for generating a list of characters including face IDs and coordinate information included in the N-3 th frame with respect to the N-3 th frame.

delete

According to claim 1,
The pre-processing operation is
An electronic device comprising at least one of a process for converting an image file format to a [Channel, Width, Height] RGB format, and a process for reducing an image resolution to a preset size.

According to claim 1,
The processor is
By inputting the inputted captured image to the pre-learned face detection deep learning model, list information about the face position and the probability of the face position is obtained, and the face position having a probability lower than a preset value is filtered, An electronic device for detecting a face in the captured image by performing non-maximum suppression (NMS) to integrate faces with mutually overlapping positions.

According to claim 1,
The processor is
An electronic device that compares previously detected coordinate information with the coordinate information, and generates image content by using the changed coordinate information when the movement of the coordinate information is equal to or greater than a preset value.

According to claim 1,
The processor is
An electronic device for controlling the communication device so that each of the image content for each of the plurality of faces is transmitted to different output ports.

According to claim 1,
The processor is
An electronic device generating merged content by merging at least two of the generated image content for each of the plurality of faces.

In the content creation method,
receiving a stream corresponding to a captured image in real time;
performing a pre-processing operation on frames constituting the input stream, and generating face ID and coordinate information included in the frame using the pre-processed frame image and a deep learning pipeline; and
generating video content for each detected face in real time using the face ID, the coordinate information, and the captured image;
The deep learning pipeline is
Using the face detection deep learning model previously learned for the Nth frame, detecting a face included in the input captured image, extracting a face image corresponding to the detected face, and coordinate information of the face image,
Extracting the eigenvalue for the face image for the N-1th frame,
Comparing the extracted eigenvalue with respect to the N-2th frame and the pre-stored eigenvalue for each person,
A content creation method for generating a list of characters including face ID and coordinate information included in the N-3th frame with respect to the N-3th frame.

delete

10. The method of claim 9,
The pre-processing operation is
A content creation method comprising at least one of a process of converting an image file format to [Channel, Width, Height] RGB format, and a process of reducing the image resolution to a preset size.

10. The method of claim 9,
The deep learning pipeline is
By inputting the inputted captured image to the pre-learned face detection deep learning model, list information about the face position and the probability of the face position is obtained, and the face position having a probability lower than a preset value is filtered, A content creation method for detecting faces in the captured image by integrating faces that overlap each other by performing non-maximum suppression (NMS).

10. The method of claim 9,
The generating step is
A method for generating content by comparing the coordinate information detected immediately before and the coordinate information, and generating image content by using the changed coordinate information when the movement of the coordinate information is equal to or greater than a preset value.

10. The method of claim 9,
and transmitting the image content for each of the detected faces to different output ports.

10. The method of claim 9,
and generating merged content by merging at least two of the generated image content for each of the plurality of faces.

A computer-readable recording medium comprising a program for executing a content creation method,
The content creation method includes:
receiving a stream corresponding to a captured image in real time;
performing a pre-processing operation on frames constituting the input stream, and generating face ID and coordinate information included in the frame using the pre-processed frame image and a deep learning pipeline; and
generating video content for each detected face in real time using the face ID, the coordinate information, and the captured image;
The deep learning pipeline is
Using the face detection deep learning model previously learned for the Nth frame, detecting a face included in the input captured image, extracting a face image corresponding to the detected face, and coordinate information of the face image,
Extracting the eigenvalue for the face image for the N-1th frame,
Comparing the extracted eigenvalue with respect to the N-2th frame and the pre-stored eigenvalue for each person,
A computer-readable recording medium for generating a character list including face ID and coordinate information included in the N-3th frame with respect to the N-3th frame.

delete

18. The method of claim 17,
The generating step is
A computer-readable recording medium that compares the coordinate information detected immediately before and the coordinate information, and generates image content by using the changed coordinate information when the movement of the coordinate information is equal to or greater than a preset value.

18. The method of claim 17,
The content creation method includes:
The computer-readable recording medium further comprising a; transmitting each of the image content for each of the plurality of faces to different output ports.