KR20210004195A

KR20210004195A - Integrated image generation apparatus of representative face using face feature embedding

Info

Publication number: KR20210004195A
Application number: KR1020190080225A
Authority: KR
Inventors: 박혜림; 조우진
Original assignee: 박혜림
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2021-01-13

Abstract

The present invention relates to an integrated image output device of a representative face using face feature embedding. To this end, the integrated image output device comprises: an image segment generating module for receiving original image data, which is an original image desired to be reprocessed for each person, and dividing the original image data into a plurality of pieces through scene change detection to generate a plurality of image segments; a face clustering module for receiving the image segments from the image segment generating module, and generating representative face data which is a face image of a representative person included in the original image data by clustering a face of a person in each of the image segments; and an image integration module for receiving selected face data, which is information on an image of the person desired to be reprocessed by a user, and integrally merging the image segments including the selected face data to generate integrated image data.

Description

An integrated image generation apparatus of representative face using face feature embedding

본 발명은 얼굴 특징 임베딩을 이용한 대표 얼굴의 통합 영상 출력 장치에 관한 것이다. The present invention relates to an integrated image output apparatus of a representative face using facial feature embedding.

근래 들어 다양한 영상 콘텐츠를 직접적으로 소비하는 것이 아니라 이를 인물 중심으로 재가공하여 공유, 배포하고자 하는 수요가 증가 추세에 있다. 구체적인 사례로는 K-POP 아이돌 팬들이 기존의 방송 영상을 각 멤버 별로 재가공하여 숏 비디오 클립을 제작하거나, 방송사 자체에서도 영상을 인물별로 요약하거나 하이라이트 영상을 별도로 제작하는 등의 서비스를 제공하고 있다. Recently, there is a growing demand to share and distribute various video contents by reprocessing them, rather than directly consuming them. As a specific example, K-POP idol fans reprocess existing broadcast images for each member to produce short video clips, or the broadcaster itself provides services such as summarizing the images for each person or producing a separate highlight image.

하지만 현재까지는 위에서 언급한 대부분의 작업들이 개개인들이 수작업으로 특정 인물이 나오는 프레임을 찾아내서 편집하는 방식을 채택하고 있어 매우 비효율적으로 작업이 진행되고 있다. Face recognition/identification과 같이 인물 중심의 콘텐츠 재가공 과정에 활용 가능한 기술들이 존재하지만 대부분의 경우 특정 인물을 분류해내기 위해서는 해당 인물에 대한 대량의 데이터가 필요하기에 수많은 인물들이 등장하는 영상 콘텐츠에 적용하기에는 어려움이 많다. However, until now, most of the works mentioned above employ a method in which individuals manually find and edit a frame in which a specific person appears, so the work is very inefficient. There are technologies that can be used in the process of reprocessing people-centered content such as face recognition/identification, but in most cases, a large amount of data on the person is required to classify a specific person. There are many difficulties.

(특허문헌 1) 대한민국 공개특허 10-2019-0021130, 얼굴 이미지 기반의 유사 이미지 검출 방법 및 장치, 삼성전자 주식회사(Patent Document 1) Republic of Korea Patent Publication 10-2019-0021130, Face image-based similar image detection method and device, Samsung Electronics Co., Ltd.

따라서, 본 발명의 목적은 소규모 인물로 구성된 얼굴 데이터셋을 학습시킨 머신러닝 알고리즘에 기반한 얼굴 군집화(clustering) 기법을 바탕으로 사용자의 선택에 따라 원하는 인물이 등장하는 영상만을 자동으로 편집해 제공하는 얼굴 군집화 기법을 이용한 고속 영상 추출 장치를 제공하는데에 있다. Accordingly, an object of the present invention is a face that automatically edits and provides only images in which a desired person appears according to a user's selection based on a face clustering technique based on a machine learning algorithm that trains a face dataset composed of small people. It is to provide a high-speed image extraction apparatus using a clustering technique.

이하 본 발명의 목적을 달성하기 위한 구체적 수단에 대하여 설명한다.Hereinafter, specific means for achieving the object of the present invention will be described.

본 발명의 목적은, 인물별로 재가공을 원하는 원본 영상인 원본 영상 데이터를 수신하고, 장면 전환 검출(Scene change detection)을 통해 상기 원본 영상 데이터를 복수개로 구분하여 복수의 영상 세그먼트를 생성하는 영상 세그먼트 생성 모듈; 상기 영상 세그먼트 생성 모듈에서 복수의 상기 영상 세그먼트를 수신하고, 각각의 상기 영상 세그먼트에서 인물의 얼굴을 군집화하여 상기 원본 영상 데이터에 포함된 대표 인물의 얼굴 이미지인 대표 얼굴 데이터를 생성하는 얼굴 군집화 모듈; 및 사용자가 재가공을 원하는 인물의 이미지에 대한 정보인 선택 얼굴 데이터를 사용자 클라이언트로부터 수신하고, 상기 선택 얼굴 데이터가 포함된 상기 영상 세그먼트를 통합(Video merging)하여 통합 영상 데이터를 생성하는 영상 통합 모듈;을 포함하고, 소규모 인물로 구성된 얼굴 데이터셋을 학습시킨 머신러닝 알고리즘에 기반한 얼굴 군집화(clustering) 기법을 바탕으로 상기 사용자의 선택에 따라 원하는 인물이 등장하는 영상만을 자동으로 편집해 출력하는 것을 특징으로 하는, 얼굴 군집화 기법을 이용한 고속 영상 추출 장치를 제공하여 달성될 수 있다. It is an object of the present invention to generate an image segment to generate a plurality of image segments by receiving original image data, which is an original image desired to be reprocessed for each person, and dividing the original image data into a plurality through scene change detection. module; A face clustering module configured to generate representative face data, which is a face image of a representative person included in the original image data, by receiving a plurality of the image segments in the image segment generation module and clustering the faces of a person in each of the image segments; And an image integration module configured to receive selected face data, which is information on an image of a person desired to be reprocessed by the user, from a user client, and to generate integrated image data by merging the image segments including the selected face data. Including, based on a face clustering technique based on a machine learning algorithm in which a face dataset composed of small people is trained, only the image in which the desired person appears according to the user's selection is automatically edited and output. This can be achieved by providing a high-speed image extraction apparatus using a face clustering technique.

상기한 바와 같이, 본 발명에 의하면 이하와 같은 효과가 있다.As described above, the present invention has the following effects.

첫째, 본 발명의 일실시예에 따르면, 기존에 사람이 수작업으로 매 프레임마다 등장하는 인물을 판별하여 편집하던 과정을 반자동적인 서비스로 제공함으로써 다양한 인물 중심 영상 재가공 작업의 효율을 향상시킬 수 있다.First, according to an embodiment of the present invention, a process in which a person manually identifies and edits a person appearing in every frame is provided as a semi-automatic service, thereby improving the efficiency of various person-centered image reprocessing work.

둘째, 본 발명의 일실시예에 따르면, 소규모의 인물 데이터셋을 학습시킨 네트워크를 활용하여 대규모의 인물을 정확히 구별해내고 이를 사용자에게 제공 가능하다.Second, according to an embodiment of the present invention, it is possible to accurately identify a large-scale person and provide it to a user by using a network in which a small-scale person data set is learned.

셋째, 본 발명의 일실시예에 따르면, K-POP 아이돌 팬들이 기존의 방송, 직접 촬영한 영상 등을 멤버 개개인 위주로 재가공하여 SNS에 공유하는 것이 용이해지며, 주요 방송사에서 제공하는 방송 프로그램 하이라이트/요약 영상 등에서 인물 위주로 재가공한 영상을 배포하는 것이 용이해진다. 또한, 스포츠 경기 등에서 경기 전체 영상에서 각 선수의 활약 영상 등을 개개인 별로 추출하여 소비자에게 제공하는 것이 용이해지고, CCTV와 같은 긴 영상 속에서 등장하는 인물들을 개별적으로 구분하여 원하는 인물(ex. 얼굴이 공개된 범죄자)이 등장한 시점을 찾는데에 이용될 수 있다.Third, according to an embodiment of the present invention, it is easy for K-POP idol fans to reprocess existing broadcasts, videos taken directly, etc., and share them on SNS mainly by individual members, and highlight broadcast programs provided by major broadcasters. It becomes easier to distribute reprocessed images mainly for people in summary images and the like. In addition, it becomes easier to extract each player's activity image from the entire game video in a sports event and provide it to consumers individually. Also, people who appear in a long video such as CCTV are individually classified and the desired person (ex. It can be used to find the point at which the public offender) appeared.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 본 발명의 일실시예에 따른 얼굴 군집화 기법을 이용한 고속 영상 추출 장치를 도시한 모식도,
도 2는 본 발명의 일실시예에 따른 얼굴 군집화 모듈(20)의 구성을 도시한 모식도,
도 3은 본 발명의 일실시예에 따른 군집화 모듈(24)의 구성을 도시한 모식도이다.The following drawings attached to the present specification illustrate preferred embodiments of the present invention, and serve to further understand the technical idea of the present invention together with the detailed description of the present invention, so the present invention is limited to the matters described in such drawings. And should not be interpreted.
1 is a schematic diagram showing a high-speed image extraction apparatus using a face clustering technique according to an embodiment of the present invention;
2 is a schematic diagram showing a configuration of a face clustering module 20 according to an embodiment of the present invention;
3 is a schematic diagram showing the configuration of the clustering module 24 according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작원리를 상세하게 설명함에 있어서 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, exemplary embodiments in which the present invention can be easily implemented by those of ordinary skill in the art will be described in detail with reference to the accompanying drawings. However, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 특정 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 특정 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, the same reference numerals are used for parts having similar functions and functions throughout the drawings. Throughout the specification, when a specific part is said to be connected to another part, this includes not only the case that it is directly connected, but also the case that it is indirectly connected with another element interposed therebetween. In addition, the inclusion of a specific component does not exclude other components unless specifically stated to the contrary, but means that other components may be further included.

얼굴 군집화 기법을 이용한 고속 영상 추출 장치High-speed image extraction device using face clustering technique

도 1은 본 발명의 일실시예에 따른 얼굴 군집화 기법을 이용한 고속 영상 추출 장치를 도시한 모식도이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 얼굴 군집화 기법을 이용한 고속 영상 추출 장치(1)는, 영상 세그먼트 생성 모듈(10), 얼굴 군집화 모듈(20), 영상 통합 모듈(30)을 포함할 수 있다. 본 발명의 일실시예에 따른 얼굴 군집화 기법을 이용한 고속 영상 추출 장치(1)는 특정 웹서버, 클라우드 서버와 같은 가상 서버, 스마트폰, 태블릿 PC, 데스크탑 PC 등의 컴퓨팅 장치의 처리모듈에 의해 처리되고, 각 장치의 메모리 모듈에 저장되도록 구성될 수 있다. 1 is a schematic diagram showing a high-speed image extraction apparatus using a face clustering technique according to an embodiment of the present invention. As shown in FIG. 1, a high-speed image extraction apparatus 1 using a face clustering technique according to an embodiment of the present invention includes an image segment generation module 10, a face clustering module 20, and an image integration module 30. ) Can be included. The high-speed image extraction device 1 using the face clustering technique according to an embodiment of the present invention is processed by a processing module of a computing device such as a specific web server, a virtual server such as a cloud server, a smartphone, a tablet PC, and a desktop PC. And may be configured to be stored in a memory module of each device.

영상 세그먼트 생성 모듈(10)은 인물별로 재가공을 원하는 원본 영상인 원본 영상 데이터(100)를 수신하고, 장면 전환 검출(Scene change detection)을 통해 상기 원본 영상 데이터(100)를 복수개로 구분하여 복수의 영상 세그먼트를 생성하는 모듈이다.The image segment generation module 10 receives the original image data 100, which is the original image desired to be reprocessed for each person, and divides the original image data 100 into a plurality of pieces through scene change detection. This module creates video segments.

얼굴 군집화 모듈(20)은 상기 영상 세그먼트 생성 모듈(10)에서 영상 세그먼트를 수신하고, 각각의 영상 세그먼트에서 인물의 얼굴을 군집화하여 전체 원본 영상 데이터(100)에 포함된 대표 인물의 얼굴 이미지인 대표 얼굴 데이터(200)를 생성하는 모듈이다. 생성된 대표 얼굴 데이터(200)는 웹 또는 앱을 통해 스마트폰, 태블릿, 데스크탑, 랩탑 등의 사용자 클라이언트에 송신될 수 있다. 본 발명의 일실시예에 따른 얼굴 군집화 모듈(20)은 머신 러닝 알고리즘에 기반한 얼굴 이미지 클러스터링(face clustering)에 의해 수행될 수 있으며, 전체 영상 속에서 감지된 다양한 얼굴들을 인물 별로 군집화하여 각 인물의 대표 사진을 사용자에게 제공하게 된다. The face clustering module 20 receives an image segment from the image segment generation module 10, and clusters the faces of a person in each image segment to represent a representative person's face image included in the entire original image data 100. It is a module that generates face data 200. The generated representative face data 200 may be transmitted to user clients such as smartphones, tablets, desktops, and laptops through a web or an app. The face clustering module 20 according to an embodiment of the present invention may be performed by face image clustering based on a machine learning algorithm. Various faces detected in the entire image are clustered for each person A representative picture is provided to the user.

영상 통합 모듈(30)은 사용자가 재가공을 원하는 인물의 이미지에 대한 정보인 선택 얼굴 데이터(300)를 사용자 클라이언트로부터 수신하고, 상기 선택 얼굴 데이터(300)가 포함된 영상 세그먼트를 통합(Video merging)하여 통합 영상 데이터(310)를 생성하는 모듈이다. 생성된 통합 영상 데이터(310)은 웹 또는 앱을 통해 스마트폰, 태블릿, 데스크탑, 랩탑 등의 사용자 클라이언트에 송신(스트리밍을 포함)될 수 있다. The image integration module 30 receives selected face data 300, which is information on an image of a person that the user wants to reprocess, from a user client, and integrates the image segment including the selected face data 300 (Video merging). This is a module that generates the integrated image data 310. The generated integrated image data 310 may be transmitted (including streaming) to user clients such as smartphones, tablets, desktops, and laptops through a web or an app.

얼굴 군집화 모듈(20)의 구체적인 구성과 관련하여, 도 2는 본 발명의 일실시예에 따른 얼굴 군집화 모듈(20)의 구성을 도시한 모식도이다. 도 2에 도시된 바와 같이, 본 발명의 일실시예에 따른 얼굴 군집화 모듈(20)은 얼굴 검출 모듈(21), 랜드마크 검출 모듈(22), 표준화 모듈(23), 군집화 모듈(24)를 포함할 수 있다. Regarding the specific configuration of the face clustering module 20, FIG. 2 is a schematic diagram showing the configuration of the face clustering module 20 according to an embodiment of the present invention. As shown in Fig. 2, the face clustering module 20 according to an embodiment of the present invention includes a face detection module 21, a landmark detection module 22, a standardization module 23, and a clustering module 24. Can include.

얼굴 검출 모듈(21)은 수신한 영상 세그먼트(110)에서 얼굴 부분을 검출하고 얼굴 검출 데이터(예를 들어, 바운딩 박스)를 생성하는 모듈이다. 본 발명의 일실시예에 따른 얼굴 검출 모듈(21)은 YOLO, RCNN, Faster RCNN 등을 Fine-tunning한 얼굴 검출 알고리즘을 이용할 수 있다. 또는, ImageNet으로 기학습된 AlexNet 등의 네트워크를 Fine-tunning 한 얼굴 검출 알고리즘을 이용할 수 있다. 나아가, Viola-jones의 Haar-like Feature를 Boosting 등의 기존 컴퓨터 비전 알고리즘을 이용할 수 있다. The face detection module 21 is a module that detects a face part from the received image segment 110 and generates face detection data (eg, bounding box). The face detection module 21 according to an embodiment of the present invention may use a face detection algorithm fine-tuning YOLO, RCNN, Faster RCNN, and the like. Alternatively, a face detection algorithm that fine-tuned a network such as AlexNet previously learned with ImageNet can be used. Furthermore, it is possible to use existing computer vision algorithms such as Boosting the Haar-like Feature of Viola-jones.

랜드마크 검출 모듈(22)은 상기 얼굴 검출 모듈(21)에서 생성한 얼굴 검출 데이터를 기초로 해당 얼굴의 랜드마크를 검출하여 랜드마크 데이터를 생성하는 모듈이다. 본 발명의 일실시예에 따른 랜드마크 검출 모듈(22)은 Cascade 방식의 CNN 기반 아키텍쳐 또는 Auto encoder를 포함한 아키텍쳐로 구성될 수 있다. The landmark detection module 22 is a module that generates landmark data by detecting a landmark of a corresponding face based on the face detection data generated by the face detection module 21. The landmark detection module 22 according to an embodiment of the present invention may be configured with a Cascade CNN-based architecture or an architecture including an auto encoder.

표준화 모듈(23)은 랜드마크 검출 모듈(22)에서 생성한 랜드마크 데이터를 기초로 입력된 얼굴 검출 데이터를 표준화하여 표준화 얼굴 데이터를 생성하는 모듈이다. The standardization module 23 is a module for generating standardized face data by standardizing the input face detection data based on the landmark data generated by the landmark detection module 22.

군집화 모듈(24)은 표준화 모듈(23)에서 생성된 표준화 얼굴 데이터를 기초로 전체 영상 속에서 감지된 다양한 얼굴들을 인물 별로 군집화하여 각 인물의 대표 사진인 대표 얼굴 데이터(200)를 생성하고 출력하는 모듈이다. The clustering module 24 generates and outputs representative face data 200, which is a representative picture of each person, by clustering various faces detected in the entire image for each person based on the standardized face data generated by the standardization module 23. It is a module.

군집화 모듈(24)의 구체적인 구성과 관련하여, 본 발명의 일실시예에 따른 군집화 모듈(24)은 Supervised Learning의 형태로 구성이 가능하고, 선형/로지스틱 회귀분석(Regression), 서포트 벡터 머신(Support Vector Machine), 다층 퍼셉트론(Multi-layer perceptron), 나이브 베이지안 분류(Naive-Bayesian Classification), 랜덤 포레스트 분류(Random Forest Classification), 인공신경망(Neural Network) 등의 다양한 머신러닝 알고리즘으로 구성이 가능하다. 설명의 편의를 위하여 이하에서는 본 발명의 일실시예에 따라 군집화 모듈(24)을 인공신경망으로 구성한 예시로 설명한다. 이하에서, 컨볼루젼 레이어(Convolution Layer)는 설명의 편의를 위해 "CONV layer", "Conv. layer" 으로 혼용될 수 있고, 콘볼루젼 뉴럴 네트워크(Convolutional Neural Network)는 "ConvNet", "CNN" 등으로 혼용될 수 있다.Regarding the specific configuration of the clustering module 24, the clustering module 24 according to an embodiment of the present invention can be configured in the form of Supervised Learning, and includes a linear/logistic regression analysis and a support vector machine. Vector Machine), Multi-layer perceptron, Naive-Bayesian Classification, Random Forest Classification, Neural Network, etc.It can be composed of various machine learning algorithms. For convenience of explanation, the following describes an example in which the clustering module 24 is configured as an artificial neural network according to an embodiment of the present invention. Hereinafter, the convolution layer may be mixed with “CONV layer” and “Conv. layer” for convenience of description, and the convolutional neural network is “ConvNet”, “CNN”, etc. Can be used interchangeably.

도 3은 본 발명의 일실시예에 따른 군집화 모듈(24)의 구성을 도시한 모식도이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 군집화 모듈(24)은 컨볼루전 레이어(241), 풀링 레이어(242), 얼굴 군집화 레이어(243)가 포함된 인공신경망 모듈로 구성될 수 있다.3 is a schematic diagram showing the configuration of the clustering module 24 according to an embodiment of the present invention. As shown in FIG. 3, the clustering module 24 according to an embodiment of the present invention is composed of an artificial neural network module including a convolution layer 241, a pooling layer 242, and a face clustering layer 243. I can.

본 발명의 일실시예에 따르면, 군집화 모듈(24)에 INPUT 입력 이미지인 표준화 얼굴 데이터(230)가 가로 47, 세로 55, 그리고 RGB 채널을 가질 수 있고, 이때 입력되는 표준화 얼굴 데이터(230)의 크기는 [47x55x3]이다. 컨볼루젼 필터(Conv.Filter)는 입력 이미지인 표준화 얼굴 데이터(230)의 일부 영역과 연결되어 있으며, 이 연결된 영역과 자신의 가중치의 내적 연산(dot product)을 계산하게 되고, 커널 사이즈는 [4x4x3]으로 구성될 수 있다. 결과 볼륨인 컨볼루젼 레이어(Conv. layer, 241)는 [44x52x20]와 같은 크기를 갖게 된다. RELU 레이어는 max(0,x)와 같이 각 요소에 적용되는 액티베이션 함수(activation function)이다. RELU 레이어는 볼륨의 크기를 변화시키지 않는다([44x52x20]). 그 결과 Activation map 을 생성한다. 풀링 레이어(pooling layer, 242)는 "가로,세로" 차원에 대해 다운샘플링(downsampling)을 수행해 [22x26x20]와 같이 줄어든 볼륨(Activation map)을 출력한다. 이후 뎁스(depth)가 더 깊어진 제2컨볼루젼 레이어, 제2풀링 레이어, 제3컨볼루젼 레이어, 제3풀링 레이어, 제4컨볼루젼 레이어가 연결되고, 제4컨볼루젼 레이어에서 곧바로 n개의 노드를 가진 출력층인 얼굴 군집화 레이어(243)가 연결되도록 구성될 수 있다.According to an embodiment of the present invention, the standardized face data 230, which is an INPUT input image to the clustering module 24, may have 47 horizontal, 55 vertical, and RGB channels, and at this time, the input standardized face data 230 The size is [47x55x3]. The convolution filter (Conv.Filter) is connected to a part of the standardized face data 230, which is an input image, and calculates the dot product of the connected region and its weight, and the kernel size is [4x4x3 It can be composed of ]. The resulting volume, Conv. layer, 241 has the same size as [44x52x20]. The RELU layer is an activation function applied to each element, such as max(0,x). The RELU layer does not change the volume size ([44x52x20]). As a result, we create an activation map. The pooling layer 242 performs downsampling on the "horizontal, vertical" dimension to output a reduced volume (Activation map) such as [22x26x20]. After that, the second convolution layer, the second pooling layer, the third convolution layer, the third pooling layer, and the fourth convolution layer are connected, and n nodes are immediately connected from the fourth convolution layer. The face clustering layer 243, which is an excitation output layer, may be configured to be connected.

본 발명의 일실시예에 따른 군집화 모듈(24)에서는, 일반적인 ConvNet의 구조와 달리 마지막 컨볼루젼 레이어에 출력층인 얼굴 군집화 레이어(243)가 직접 연결되는 것이 특징적이다. 즉, 본 발명의 일실시예에 따른 군집화 모듈(24)에서는 n여명의 인물에 대한 얼굴 데이터셋으로 학습시킨 face identification network에서 최종적인 identification 결과가 아닌 face feature embedding 단계까지 만을 이용한다. 그에 따라 목표 영상 속에 학습 데이터에 포함되지않은 얼굴이 등장하더라도 해당 인물이 어떤 인물인지는 알 수 없지만, 대상들을 개별적인 인물 A, B로 구분 가능하게 되어 본 발명 시스템에 적용 가능하다.In the clustering module 24 according to an embodiment of the present invention, unlike a general ConvNet structure, the face clustering layer 243, which is an output layer, is directly connected to the last convolution layer. That is, the clustering module 24 according to an embodiment of the present invention uses only the face feature embedding step, not the final identification result, in the face identification network learned from the face dataset for n people. Accordingly, even if a face that is not included in the learning data appears in the target image, it is not possible to know what kind of person the person is, but it is possible to classify the objects into individual persons A and B, and thus can be applied to the present invention system.

기존의 일반적인 ConvNet에서는 마지막 컨볼루젼 레이어 또는 마지막 풀링 레이어 이후에 n개의 풀리 커넥티드 레이어(FC, Fully-connected layer)가 연결되게 된다. FC (fully-connected) 레이어(105)는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨(output layer, 106)을 출력한다. FC 레이어는 이전 볼륨의 모든 요소와 연결되어 있고, 최종적인 identification을 담당하게 된다.In the conventional general ConvNet, n fully-connected layers (FCs) are connected after the last convolution layer or the last pooling layer. The fully-connected (FC) layer 105 calculates class scores and outputs an output layer 106 having a size of [1x1x10]. The FC layer is connected to all elements of the previous volume and is responsible for the final identification.

이와 같이, 본 발명의 일실시예에 따른 군집화 모듈(24)의 ConvNet은 픽셀 값으로 이뤄진 원본 이미지(표준화 얼굴 데이터, 230)를 각 레이어를 거치며 해당 표준화 얼굴 데이터를 특정 벡터로 임베딩 시키게 되므로 얼굴 군집 데이터를 생성할 수 있게 된다. 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다. 특히 CONV 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias)도 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트(gradient descent)로 학습된다.As described above, the ConvNet of the clustering module 24 according to an embodiment of the present invention passes the original image (normalized face data, 230) composed of pixel values through each layer, and embeds the corresponding standardized face data as a specific vector. Data can be created. Some layers have parameters, while others do not. In particular, CONV layers are activation functions that include not only the input volume but also the weight and bias. On the other hand, RELU/POOL layers are fixed functions. The parameters of the CONV layer are learned with gradient descent so that the class score for each image is the same as the label of the image.

CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스(forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩시키며(정확히는 convolve시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력 볼륨 사이에서 내적 연산(dot product)이 이뤄진다. 이러한 과정으로 ConvNet은 입력 데이터의 특정 위치의 특정 패턴에 대해 반응하는(activate) 필터를 학습하게 된다. 이런 액티베이션 맵(activation map)을 깊이(depth) 차원으로 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 필터를 적용한 결과이므로 같은 모수들을 공유한다.The parameters of the CONV layer consist of a series of learnable filters. Each filter is small in the horizontal/vertical dimension but covers the entire depth in the depth dimension. In the forward pass, each filter is slid (convolved precisely) to the horizontal/vertical dimensions of the input volume, and a two-dimensional activation map is created. When sliding the filter over the input, a dot product is performed between the filter and the input volume. Through this process, ConvNet learns a filter that activates for a specific pattern at a specific location of the input data. The stacking of these activation maps in the depth dimension becomes the output volume. Therefore, each element of the output volume only handles a small area of the input, and neurons in the same activation map share the same parameters as the result of applying the same filter.

본 발명의 일실시예에 따르면, Back propagation에서 chain rule을 적용하면서 error가 앞단의 layer에서 희석되는 vanishing gradient 문제가 발생되어 시그모이드 함수 대신, ReLU가 이용될 수 있다. sigmoid 함수 사용시 모든 값에 대한 계산을 해야하는데, ReLU 함수는 상당 부분의 연산량을 줄일 수 있어 컴퓨팅 속도가 개선되는 효과가 발생된다. ReLU 함수에 의해 정규화(Regularization)가 향상될 수 있다.According to an embodiment of the present invention, a vanishing gradient problem in which an error is diluted in a layer at the front end occurs while applying a chain rule in back propagation, so that ReLU may be used instead of a sigmoid function. When using the sigmoid function, all values must be calculated, but the ReLU function can reduce the amount of computation in a significant part, thereby improving the computing speed. Regularization can be improved by the ReLU function.

또한, 본 발명의 일실시예에 따른 군집화 모듈(24)의 학습에 있어서, 학습 세션(Training Session)에서 입력되는 상기 표준화 얼굴 데이터(230)를 Random Cropping하여서 복수개의 Patch 형태로 입력 데이터를 다변화하여, 추론 세션(Inference Session)에서의 정확도를 향상시킬 수 있다. In addition, in the learning of the clustering module 24 according to an embodiment of the present invention, the standardized face data 230 input in a training session is randomly cropped to diversify the input data into a plurality of patches. , It is possible to improve the accuracy in the inference session.

또한, 본 발명의 일실시예에 따른 군집화 모듈(24)의 학습에 있어서, SoftMax Loss는 Identification Loss로 가 정하고 Euclidean Distance를 이용한 Loss는 Verification Loss로 가정하여서 Multi-task 형태의 학습 세션을 이용하여 정확도를 향상시킬 수 있다. In addition, in the learning of the clustering module 24 according to an embodiment of the present invention, the SoftMax Loss is assumed to be Identification Loss, and the Loss using Euclidean Distance is assumed to be Verification Loss, and accuracy using a multi-task learning session Can improve.

이상에서 설명한 바와 같이, 본 발명이 속하는 기술 분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 상술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함하는 것으로 해석되어야 한다.As described above, those skilled in the art to which the present invention pertains will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are illustrative in all respects and should be understood as non-limiting. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts should be interpreted as being included in the scope of the present invention.

본 명세서 내에 기술된 특징들 및 장점들은 모두를 포함하지 않으며, 특히 많은 추가적인 특징들 및 장점들이 도면들, 명세서, 및 청구항들을 고려하여 당업자에게 명백해질 것이다. 더욱이, 본 명세서에 사용된 언어는 주로 읽기 쉽도록 그리고 교시의 목적으로 선택되었고, 본 발명의 주제를 묘사하거나 제한하기 위해 선택되지 않을 수도 있다는 것을 주의해야 한다.The features and advantages described herein are not all inclusive, and in particular many additional features and advantages will become apparent to those skilled in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used herein has been selected primarily for readability and for teaching purposes, and may not be chosen to describe or limit the subject matter of the invention.

본 발명의 실시예들의 상기한 설명은 예시의 목적으로 제시되었다. 이는 개시된 정확한 형태로 본 발명을 제한하거나, 빠뜨리는 것 없이 만들려고 의도한 것이 아니다. 당업자는 상기한 개시에 비추어 많은 수정 및 변형이 가능하다는 것을 이해할 수 있다.The above description of embodiments of the present invention has been presented for purposes of illustration. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Those skilled in the art will understand that many modifications and variations are possible in light of the above disclosure.

그러므로 본 발명의 범위는 상세한 설명에 의해 한정되지 않고, 이를 기반으로 하는 출원의 임의의 청구항들에 의해 한정된다. 따라서, 본 발명의 실시예들의 개시는 예시적인 것이며, 이하의 청구항에 기재된 본 발명의 범위를 제한하는 것은 아니다.Therefore, the scope of the invention is not limited by the detailed description, but by any claims in the application on which it is based. Accordingly, the disclosure of the embodiments of the present invention is illustrative and does not limit the scope of the present invention described in the following claims.

1: 얼굴 군집화 기법을 이용한 고속 영상 추출 장치
10: 영상 세그먼트 모듈
20: 얼굴 군집화 모듈
21: 얼굴 검출 모듈
22: 랜드마크 검출 모듈
23: 표준화 모듈
24: 군집화 모듈
30: 영상 통합 모듈
100: 원본 영상 데이터
110: 영상 세그먼트
200: 대표 얼굴 데이터
230: 표준화 얼굴 데이터
241: 컨볼루젼 레이어
242: 풀링 레이어
243: 얼굴 군집화 레이어
300: 선택 얼굴 데이터
310: 통합 영상 데이터1: Fast image extraction device using face clustering technique
10: video segment module
20: Face clustering module
21: face detection module
22: landmark detection module
23: standardization module
24: clustering module
30: video integration module
100: original image data
110: video segment
200: representative face data
230: standardized face data
241: convolution layer
242: pooling layer
243: Face clustering layer
300: Select face data
310: Integrated video data

Claims

An image segment generation module for receiving original image data, which is an original image desired to be reprocessed for each person, and generating a plurality of image segments by dividing the original image data into a plurality of pieces through scene change detection;
A face clustering module configured to generate representative face data, which is a face image of a representative person included in the original image data, by receiving a plurality of the image segments in the image segment generation module and clustering the faces of a person in each of the image segments; And
An image integration module configured to receive selected face data, which is information on an image of a person desired to be reprocessed by a user, from a user client, and to generate integrated image data by merging the image segments including the selected face data;
Including,
The face clustering module includes an artificial neural network including a convolution layer, and a face clustering layer, which is an output layer of the artificial neural network, is directly connected to the last convolution layer, and faces for a plurality of people The representative face data is generated using only the face feature embedding step, not the final identification result in the face identification network learned with the dataset,
Characterized in that, based on a face clustering technique based on a machine learning algorithm in which a face dataset consisting of small-scale people is trained, only the image in which the desired person appears according to the user's selection is automatically edited and output.
An integrated image output device of representative faces using facial feature embedding.